[HN Gopher] It's always TCP_NODELAY
___________________________________________________________________
It's always TCP_NODELAY
Author : todsacerdoti
Score : 813 points
Date : 2024-05-09 17:54 UTC (1 days ago)
(HTM) web link (brooker.co.za)
(TXT) w3m dump (brooker.co.za)
| theamk wrote:
| I don't by the reasoning for never needing Nagle anymore. Sure,
| telnet isn't a thing today, but I bet there are still plenty of
| apps which do equivalent of: write(fd, "Host:
| ") write(fd, hostname) write(fd, "\r\n")
| write(fd, "Content-type: ") etc...
|
| this may not be 40x overhead, but it'd still 5x or so.
| otterley wrote:
| Marc addresses that: "That's going to make some "write every
| byte" code slower than it would otherwise be, but those
| applications should be fixed anyway if we care about
| efficiency."
| Arnt wrote:
| Those aren't the ones you debug, so they won't be seen by OP.
| Those are the ones you don't need to debug because Nagle saves
| you.
| rwmj wrote:
| The comment about telnet had me wondering what openssh does,
| and it sets TCP_NODELAY on every connection, even for
| interactive sessions. (Confirmed by both reading the code and
| observing behaviour in 'strace').
| c0l0 wrote:
| _Especially_ for interactive sessions, it absolutely should!
| :)
| syncsynchalt wrote:
| Ironic since Nagle's Algorithm (which TCP_NODELAY disables)
| was invented for interactive sessions.
|
| It's hard to imagine interactive sessions making more than
| the tiniest of blips on a modern network.
| eru wrote:
| Isn't video calling an interactive session?
| semi wrote:
| I think that's more two independent byte streams. You
| want low latency but what is transfered doesnt really
| impact the other side, you just constantly want to push
| the next frame
| eru wrote:
| Thanks, that makes sense!
|
| It's interesting that it's very much an interactive
| experience for the end-user. But for the logic of the
| computer, it's not interactive at all.
|
| You can make the contrast even stronger: if both video
| streams are transmitted over UDP, you don't even need to
| sent ACKs etc. To be truly one-directional from a
| technical point of view.
|
| Then compare that to transferring a file via TCP. For the
| user this is as one-directional and non-interactive as it
| gets, but the computers constantly talk back and forth.
| TorKlingberg wrote:
| Video calls indeed almost always use UDP. TCP
| retransmission isn't really useful since by the time a
| retransmitted packet arrives it's too old to display.
| Worse, a single lost packet will block a TCP stream.
| Sometimes TCP is the only way to get through a firewall,
| but the experience is bad if there's any packet loss at
| all.
|
| VC systems do constantly send back packet loss statistics
| and adjust the video quality to avoid saturating a link.
| Any buffering in routers along the way will add delay, so
| you want to keep the bitrate low enough to keep buffers
| empty.
| syncsynchalt wrote:
| You're right! (I'm ignoring the reply thread).
|
| I'm so used to a world where "interactive" was synonymous
| with "telnet" and "person on keyboard".
| temac wrote:
| Fix the apps. Nobody expect magical perf if you do that when
| writing to files, even though the OS also has its own buffers.
| There is no reason to expect otherwise when writing to a socket
| and actually nagle already doesn't save you from syscall
| overhead.
| toast0 wrote:
| Nagle doesn't save the derpy side from syscall overhead, but
| it would save the other side.
|
| It's not just apps doing this stuff, it also lives in system
| libraries. I'm still mad at the Android HTTPS library for
| sending chunked uploads as so many tinygrams. I don't
| remember exactly, but I think it's reasonable packetization
| for the data chunk (if it picked a reasonable size anyway),
| then one packet for \r\n, one for the size, and another for
| another \r\n. There's no reason for that, but it doesn't hurt
| the client enough that I can convince them to avoid the
| system library so they can fix it and the server can manage
| more throughput. Ugh. (It might be that it's just the TLS
| packetization that was this bogus and the TCP packetization
| was fine, it's been a while)
|
| If you take a pcap for some specific issue, there's always so
| many of these other terrible things in there. </rant>
| meinersbur wrote:
| Those are the apps are quickly written and do not care if
| they unnecessarily congest the network. The ones that do get
| properly maintained can set TCP_NODELAY. Seems like a
| reasonable default to me.
| ale42 wrote:
| Apps can always misbehave, you never know what people
| implement, and you don't always have source code to patch. I
| don't think the role of the OS is to let the apps do whatever
| they wish, but it should give the possibility of doing it if
| it's needed. So I'd rather say, if you know you're properly
| doing things and you're latency sensitive, just TCP_NODELAY
| on all your sockets and you're fine, and nobody will blame
| you about doing it.
| bjourne wrote:
| > Fix the apps. Nobody expect magical perf if you do that
| when writing to files,
|
| We write to files line-by-line or even character-by-character
| and expect the library or OS to "magically" buffer it into
| fast file writes. Same with memory. We expect multiple small
| mallocs to be smartly coalesced by the platform.
| eru wrote:
| Yes, your libraries should fix that. The OS (as in the
| kernel) should not try to do any abstraction.
|
| Alas, kernels really like to offer abstractions.
| _carbyau_ wrote:
| True to a degree. But that is a singular platform wholly
| controlled by the OS.
|
| Once you put packets out into the world you're in a shared
| space.
|
| I assume every conceivable variation of argument has been
| made both for and against Nagles at this point but it
| essentially revolves around a shared networking resource
| and what policy is in place for fair use.
|
| Nagles fixes a particular case but interferes overall. If
| you fix the "particular case app" the issue goes away.
| PaulDavisThe1st wrote:
| If you _expect_ a POSIX-y OS to buffer write(2) calls, you
| 're sadly misguided. Whether or not that happens depends on
| nature of the device file you're writing to.
|
| OTOH, if you're using fwrite(3), as you likely should be
| actual file I/O, then your expectation is entirely
| reasonable.
|
| Similarly with memory. If you expect brk(2) to handle
| multiple small allocations "sensibly" you're going to be
| disappointed. If you use malloc(3) then your expectation is
| entirely reasonable.
| bjourne wrote:
| Whether buffering is part of POSIX or not is beside the
| point. Any modern OS you'll find will buffer write calls
| in one way or the other. Similarly with memory. Linux
| waits until accesses page faults before reserving any
| memory pages for you. My point is that various forms of
| buffering is everywhere and in practice we do rely on it
| a whole lot.
| PaulDavisThe1st wrote:
| > Any modern OS you'll find will buffer write calls in
| one way or the other.
|
| This is simply not true as a general rule. It depends on
| the nature of the file descriptor. Yes, if the file
| descriptor refers to the file system, it will in all
| likelihood be buffered by the OS (not with O_DIRECT,
| however). But on "any modern OS", file descriptors can
| refer to things that are not files, and the buffering
| situation there will vary from case to case.
| bjourne wrote:
| You're right, Linux does not buffer writes to file
| descriptors for which buffering has no performance
| benefit...
| blahgeek wrote:
| We actually have the similar behavior when writing to files:
| contents are buffered in page cache and are written to disk
| later in batch, unless user explicitly call "sync".
| jeroenhd wrote:
| Everybody expects magical perf if you do that when writing
| files. We have RAM buffers and write caches for a reason,
| even on fast SSDs. We expect it so much that macOS doesn't
| flush to disk even when you call fsync() (files get flushed
| to the disk's write buffer instead).
|
| There's some overhead to calling write() in a loop, but it's
| certainly not as bad as when a call to write() would actually
| make the data traverse whatever output stream you call it on.
| citrin_ru wrote:
| I agree that such code should be fixed but having hard time
| persuading developers to fix their code. Many of them don't
| know what is a syscall, how making a syscall triggers sending
| of an IP packet, how a library call translates to a syscall
| e. t. c. Worse they don't want to know this, they write say
| Java code (or some other high level language) and argue that
| libraries/JDK/kernel should handle all 'low level' stuff.
|
| To get optimal performance for request-response protocols
| like HTTP one should send a full request which includes a
| request line, all headers and a POST body using a single
| write syscall (unless POST body is large and it make sense to
| write it in chunks). Unfortunately not all HTTP libraries
| work this way and a library user cannot fix this problem
| without switching a library which is: 1. not always easy 2.
| it is not widely known which libraries are efficient and
| which are not. Even if you have an own HTTP library it's not
| always trivial to fix: e. g. in Java a way to fix this
| problem while keeping code readable and idiomatic is too wrap
| socket into BufferedOutputStream which adds one more memory-
| to-memory copy for all data you are sending on top of at
| least one memory-to-memory copy you already have without a
| buffered stream; so it's not an obvious performance win for
| an application which already saturates memory bandwidth.
| josefx wrote:
| I would love to fix the apps, can you point me to the github
| repo with all the code written the last 30 years so I can get
| started?
| grishka wrote:
| And they really shouldn't do this. Even disregarding the
| network aspect of it, this is still bad for performance because
| syscalls are kinda expensive.
| jrockway wrote:
| Does this matter? Yes, there's a lot of waste. But you also
| have a 1Gbps link. Every second that you don't use the full
| 1Gbps is also waste, right?
| tedunangst wrote:
| This is why I always pad out the end of my html files with a
| megabyte of . A half empty pipe is a half wasted pipe.
| dessimus wrote:
| Just be sure HTTP Compression is off though, or you're
| still half-wasting the pipe.
|
| Better to just dump randomized uncompressible data into
| html comments.
| arp242 wrote:
| I am finally starting to understand some of these
| OpenOffice/LibreOffice commit messages like
| https://github.com/LibreOffice/core/commit/a0b6744d3d77
| eatonphil wrote:
| I imagine the write calls show up pretty easily as a bottleneck
| in a flamegraph.
| wbl wrote:
| They don't. Maybe if you're really good you notice the higher
| overhead but you expect to be spending time writing to the
| network. The actual impact shows up when the bandwidth
| consumption is way up on packet and TCP headers which won't
| show on a flamegraph that easily.
| silisili wrote:
| We shouldn't penalize the internet at large because some
| developers write terrible code.
| littlestymaar wrote:
| Isn't it how SMTP is working though?
| leni536 wrote:
| No?
| loopdoend wrote:
| Ah yeah I fixed this exact bug in net-http in Ruby core a
| decade ago.
| tptacek wrote:
| The discussion here mostly seems to miss the point. The
| argument is to _change the default_ , not to eliminate the
| behavior altogether.
| the8472 wrote:
| shouldn't autocorking help with even without nagle?
| asveikau wrote:
| I don't think that's actually super common anymore when you
| consider that doing asynchronous I/O, the only sane way to do
| that is put it into a buffer rather than blocking at every
| small write(2).
|
| Then you consider that asynchronous I/O is usually necessary
| both on server (otherwise you don't scale well) and client
| (because blocking on network calls is terrible experience,
| especially in today's world of frequent network changes,
| falling out of network range, etc.)
| sophacles wrote:
| TCP_CORK handles this better than nagle tho.
| jabl wrote:
| Even if you do nothing 'fancy' like Nagle, corking, or
| userspace building up the complete buffer before writing etc.,
| at the very least the above should be using a vectored write
| (writev() ).
| Too wrote:
| Shouldn't that go through some buffer? Unless you fflush()
| between each write?
| mannyv wrote:
| We used to call them "packlets."
|
| His "tinygrams" is pretty good too, but that sort of implies UDP
| (D -> datagrams)
| chuckadams wrote:
| > We used to call them "packlets."
|
| setsockopt(fd, IPPROTO_TCP, TCP_MAKE_IT_GO, &go, sizeof(go));
| obelos wrote:
| Not every time. Sometimes it's DNS.
| jeffrallen wrote:
| Once every 50 years and 2 billion kilometers, it's a failing
| memory chip. But you can usually just patch around them, so no
| big deal.
| skunkworker wrote:
| Don't forget BGP or running out of disk space without an alert.
| p_l wrote:
| Once it was a failing line card in router zeroing last bit in
| IPv4 addresses, resulting in ticket about "only even IPv4
| addresses are accessible" ...
| jcgrillo wrote:
| For some reason this reminded me of the "500mi email" bug
| [1], maybe a similar level of initial apparent absurdity?
|
| [1] https://www.ibiblio.org/harris/500milemail.html
| chuckadams wrote:
| The most absurd thing to me about the 500 mile email
| situation is that sendmail just happily started up and
| soldiered on after being given a completely alien config
| file. Could be read as another example of "be liberal in
| what you accept" going awry, but sendmail's wretched config
| format is really a volume of war stories all its own...
| jcgrillo wrote:
| Configuration changes are one of those areas where having
| some kind of "are you sure? (y/n)" check can really pay
| off. It wouldn't have helped in this case, because there
| wasn't really any change management process to speak of,
| but we haven't fully learned the lesson yet.
| unconed wrote:
| Confirmations are mostly useless unless you explicitly
| spell out the implications of the change. They are also
| inferior to being able to undo changes.
|
| That's a lesson many don't know.
| lanstin wrote:
| Your time from commit to live is proportional to your
| rollback to a known good state. Maybe to a power of the
| rollback time.
| rincebrain wrote:
| My favorite example of that was a while ago, "vixie-cron
| will read a cron stanza from a core dump written to
| /etc/cron.d" when you could convince it to write a core
| dump there. The other crons wouldn't touch that, but
| vixie-cron happily chomped through the core dump for "* *
| * * * root chmod u+s /tmp/uhoh" etc.
| p_l wrote:
| I can definitely confirm our initial reaction was "WTF"
| followed with idea that the dev team is making fun of us...
| but we went in and run traceroutes and there it was :O
|
| Was fixed in incredible coincidence manner, too - the _CTO_
| of the network link provider was in their offices (in the
| same building as me) and felt bored. Apparently having went
| through all the levels from hauling cables in datacenter up
| to CTO level, after short look at traceroutes he just
| picked a phone, called NOC, and ordered a line card
| replacement on the router :D
| marcosdumay wrote:
| When it fails, it's DNS. When it just stops moving, it's either
| TCP_NODELAY or stream buffering.
|
| Really complex systems (the Web) also fail because of caching.
| drivers99 wrote:
| Or SELinux
| rickydroll wrote:
| Not every time. Sometimes, the power cord is only connected at
| one end.
| sophacles wrote:
| One time for me it was: the glass was dirty.
|
| Some router near a construction site had dust settle into the
| gap between the laser and the fiber, and it attenuated the
| signal enough to see 40-50% packet loss.
|
| We figured out where the loss was and had our NOC email the
| relevant transit provider. A day later we got an email back
| from the tech they dispatched with the story.
| Sohcahtoa82 wrote:
| I chuckle whenever I see this meme, because in my experience,
| the issue is usually DHCP.
| anilakar wrote:
| But it's usually DHCP that sets the wrong DNS servers.
|
| It's funny that some folks claim DNS outage is a legitimate
| issue in systems whose both ends they control. I get it;
| reimplementing functionality is rarely a good sign, but since
| you already know your own addresses in the first place, you
| should also have an internal mechanism for sharing them.
| batmanthehorse wrote:
| Does anyone know of a good way to enable TCP_NODELAY on sockets
| when you don't have access to the source for that application? I
| can't find any kernel settings to make it permanent, or commands
| to change it after the fact.
|
| I've been able to disable delayed acks using `quickack 1` in the
| routing table, but it seems particularly hard to enable
| TCP_NODELAY from outside the application.
|
| I've been having exactly the problem described here lately, when
| communicating between an application I own and a closed source
| application it interacts with.
| tedunangst wrote:
| LD_PRELOAD.
| batmanthehorse wrote:
| Thank you, found this: https://github.com/sschroe/libnodelay
| coldpie wrote:
| Would some kind of LD_PRELOAD interception for socket(2) work?
| Call the real function, then do setsockopt or whatever, and
| return the modified socket.
| cesarb wrote:
| > Would some kind of LD_PRELOAD interception for socket(2)
| work?
|
| That would only work if the call goes through libc, and it's
| not statically linked. However, it's becoming more and more
| common to do system calls directly, bypassing libc; the Go
| language is infamous for doing that, but there's also things
| like the rustix crate for Rust
| (https://crates.io/crates/rustix), which does direct system
| calls by default.
| zbowling wrote:
| And go is wrong for doing that, at least on Linux. It
| bypasses optimizations in the vDSO in some cases. On
| Fuchsia, we made direct syscalls not through the vDSO
| illegal and it was funny the hacks to go that required. The
| system ABI of Linux really isn't the syscall interface, its
| the system libc. That's because the C ABI (and the
| behaviors of the triple it was compiled for) and its isms
| for that platform are the linga franca of that system.
| Going around that to call syscalls directly, at least for
| the 90% of useful syscalls on the system that are wrapped
| by libc, is asinine and creates odd bugs, makes crash
| reporters heuristical unwinders, debuggers, etc all more
| painful to write. It also prevents the system vendor from
| implementing user mode optimizations that avoid mode and
| context switches when necessary. We tried to solve these
| issues in Fuchsia, but for Linux, Darwin, and hell, even
| Windows, if you are making direct syscalls and it's not for
| something really special and bespoke, you are just flat-out
| wrong.
| JoshTriplett wrote:
| > The system ABI of Linux really isn't the syscall
| interface, its the system libc.
|
| You might have reasons to prefer to use libc; some
| software has reason to not use libc. Those preferences
| are in conflict, but one of them is not automatically
| right and the other wrong in all circumstances.
|
| Many UNIX systems _did_ follow the premise that you
| _must_ use libc and the syscall interface is unstable.
| Linux pointedly did not, and decided to have a stable
| syscall ABI instead. This means it 's possible to have
| multiple C libraries, as well as other libraries, which
| have different needs or goals and interface with the
| system differently. That's a _useful_ property of Linux.
|
| There are a couple of established mechanism on Linux for
| intercepting syscalls: ptrace, and BPF. If you want to
| intercept all uses of a syscall, intercept the syscall.
| If you want to intercept a particular glibc function _in
| programs using glibc_ , or for that matter a musl
| function in a program using musl, go ahead and use
| LD_PRELOAD. But the Linux syscall interface is a valid
| and stable interface to the system, and that's why
| LD_PRELOAD is not a complete solution.
| zbowling wrote:
| It's true that Linux has a stable-ish syscall table. What
| is funny is that this caused a whole series of Samsung
| Android phones to reboot randomly with some apps because
| Samsung added a syscall at the same position someone else
| did in upstream linux and folks staticly linking their
| own libc to avoid boionc libc were rebooting phones when
| calling certain functions because the Samsung syscall
| causing kernel panics when called wrong. Goes back to it
| being a bad idea to subvert your system libc. Now, distro
| vendors do give out multiple versions of a libc that all
| work with your kernel. This generally works. When we had
| to fix ABI issues this happened a few times. But I
| wouldn't trust building our libc and assuming that libc
| is portable to any linux machine to copy it to.
| cesarb wrote:
| > It's true that Linux has a stable-ish syscall table.
|
| It's not "stable-ish", it's fully stable. Once a syscall
| is added to the syscall table on a released version of
| the official Linux kernel, it might later be replaced by
| a "not implemented" stub (which always returns -ENOSYS),
| but it will never be reused for anything else. There's
| even reserved space on some architectures for the STREAMS
| syscalls, which were AFAIK never on any released version
| of the Linux kernel.
|
| The exception is when creating a new architecture; for
| instance, the syscall table for 32-bit x86 and 64-bit x86
| has a completely different order.
| withinboredom wrote:
| I think what they meant (judging by the example you
| ignored) is that the table changes (even if append-only)
| and you don't know which version you actually have when
| you statically compile your own version. Thus, your
| syscalls might be using a newer version of the table but
| it a) not actually be implemented, or b) implemented with
| something bespoke.
| cesarb wrote:
| > Thus, your syscalls might be using a newer version of
| the table but it a) not actually be implemented,
|
| That's the same case as when a syscall is later removed:
| it returns -ENOSYS. The correct way is to do the call
| normally as if it were implemented, and if it returns
| -ENOSYS, you know that this syscall does not exist in the
| currently running kernel, and you should try something
| else. That is the same no matter whether it's compiled
| statically or dynamically; even a dynamic glibc has
| fallback paths for some missing syscalls (glibc has a
| minimum required kernel version, so it does not need to
| have fallback paths for features introduced a long time
| ago).
|
| > or b) implemented with something bespoke.
|
| There's nothing you can do to protect against a modified
| kernel which does something different from the upstream
| Linux kernel. Even going through libc doesn't help, since
| whoever modified the Linux kernel to do something
| unexpected could also have modified the C library to do
| something unexpected, or libc could trip over the
| unexpected kernel changes.
|
| One example of this happening is with seccomp filters.
| They can be used to make a syscall fail with an
| unexpected error code, and this can confuse the C
| library. More specifically, a seccomp filter which forces
| the clone3 syscall to always return -EPERM breaks newer
| libc versions which try the clone3 syscall first, and
| then fallback to the older clone syscall if clone3
| returned -ENOSYS (which indicates an older kernel that
| does not have the clone3 syscall); this breaks for
| instance running newer Linux distributions within older
| Docker versions.
| withinboredom wrote:
| Every kernel I've ever used has been different from an
| upstream kernel, with custom patches applied. It's
| literally open source, anyone can do anything to it that
| they want. If you are using libc, you'd have a reasonable
| expectation not to need to know the details of those
| changes. If you call the kernel directly via syscall,
| then yeah, there is nothing you can do about someone
| making modifications to open source software.
| tedunangst wrote:
| The complication with the linux syscall interface is that
| it turns the worse is better up to 11. Like setuid works
| on a per thread basis, which is seriously not what you
| want, so every program/runtime must do this fun little
| thread stop and start and thunk dance.
| JoshTriplett wrote:
| Yeah, agreed. One of the items on my _long_ TODO list is
| adding `setuid_process` and `setgid_process` and similar,
| so that perhaps a decade later when new runtimes can
| count on the presence of those syscalls, they can stop
| duplicating that mechanism in userspace.
| toast0 wrote:
| > The system ABI of Linux really isn't the syscall
| interface, its the system libc.
|
| Which one? The Linux Kernel doesn't provide a libc. What
| if you're a static executable?
|
| Even on Operating Systems with a libc provided by the
| kernel, it's almost always allowed to upgrade the kernel
| without upgrading the userland (including libc); that
| works because the interface between userland and kernel
| is syscalls.
|
| That certainly ties something that makes syscalls to a
| narrow range of kernel versions, but it's not as if
| dynamically linking libc means your program will be
| compatible forever either.
| jimmaswell wrote:
| > That certainly ties something that makes syscalls to a
| narrow range of kernel versions
|
| I don't think that's right, wouldn't it be the earliest
| kernel supporting that call and onwards? The Linux ABI
| intentionally never breaks userland.
| toast0 wrote:
| In the case where you're running an Operating System that
| provides a libc and is OK with removing older syscalls,
| there's a beginning and an end to support.
|
| Looking at FreeBSD under /usr/include/sys/syscall.h,
| there's a good number of retired syscalls.
|
| On Linux under /usr/include/x86_64-linux-
| gnu/asm/unistd_32.h I see a fair number of missing
| numbers --- not sure what those are about, but 222, 223,
| 251, 285, and 387-392 are missing. (on Debian 12.1 with
| linux-image-6.1.0-12-amd64 version 6.1.52-1, if it
| matters)
| assassinator42 wrote:
| The proliferation of Docker containers seems to go
| against that. Those really only work well since the
| kernel has a stable syscall ABI. So much so that you see
| Microsoft switching to a stable syscall ABI with Windows
| 11.
| sophacles wrote:
| Linux is also weird because there are syscalls not
| supported in most (any?) libc - things like io_uring, and
| netlink fall into this.
| gpderetta wrote:
| Futex for a very long time was only accessible via
| syscall.
| Thaxll wrote:
| Those are very strong words...
| leni536 wrote:
| It should be possible to use vDSO without libc, although
| probably a lot of work.
| LegionMammal978 wrote:
| It's not that much work; after all, every libc needs to
| have its own implementation. The kernel maps the vDSO
| into memory for you, and gives you the base address as an
| entry in the auxiliary vector.
|
| But using it does require some basic knowledge of the ELF
| format on the current platform, in order to parse the
| symbol table. (Alongside knowledge of which functions are
| available in the first place.)
| intelVISA wrote:
| It's hard work to NOT have the damn vDSO invade your
| address space. Only kludge part of Linux, well, apart
| from Nagle's, dlopen, and that weird zero copy kernel
| patch that mmap'd -each- socket recv(!) for a while.
| pie_flavor wrote:
| You seem to be saying 'it was incorrect on Fuchsia, so
| it's incorrect on Linux'. No, it's correct on Linux, and
| incorrect on every other platform, as each platform's
| documentation is very clear on. Go did it incorrectly on
| FreeBSD, but that's Go being Go; they did it in the first
| place because it's a Linux-first system and it's correct
| on Linux. And glibc does not have any special privilege,
| the vdso optimizations it takes advantage of are just as
| easily taken advantage of by the Go compiler. There's no
| reason to bucket Linux with Windows on the subject of
| syscalls when the Linux manpages are very clear that
| syscalls are there to be used and exhaustively documents
| them, while MSDN is very clear that the system interface
| is kernel32.dll and ntdll.dll, and shuffles the syscall
| numbers every so often so you don't get any funny ideas.
| asveikau wrote:
| Linux doesn't even have consensus on what libc to use,
| and ABI breakage between glibc and musl is not unheard
| of. (Probably not for syscalls but for other things.)
| LegionMammal978 wrote:
| > And go is wrong for doing that, at least on Linux. It
| bypasses optimizations in the vDSO in some cases.
|
| Go's runtime _does_ go through the vDSO for syscalls that
| support it, though (e.g., [0]). Of course, it won 't
| magically adapt to new functions added in later kernel
| versions, but neither will a statically-linked libc. And
| it's not like it's a regular occurrence for Linux to new
| functions to the vDSO, in any case.
|
| [0] https://github.com/golang/go/blob/master/src/runtime/
| time_li...
| praptak wrote:
| Attach debugger (ptrace), call setsockopt?
| the8472 wrote:
| opening `/proc/<pid>/fd/<fd number>` and setting the socket
| option may work (not tested)
| tuetuopay wrote:
| you could try ebpf and hook on the socket syscall. might be
| harder than LD_PRELOAD as suggested by other commenters though
| jdadj wrote:
| Depending on the specifics, you might be able to add socat in
| the middle.
|
| Instead of: your_app --> server
|
| you'd have: your_app -> localhost_socat -> server
|
| socat has command line options for setting tcp_nodelay. You'd
| need to convince your closed source app to connect to
| localhost, though. But if it's doing a dns lookup, you could
| probably convince it to connect to localhost with an /etc/hosts
| entry
|
| Since your app would be talking to socat over a local socket,
| the app's tcp_nodelay wouldn't have any effect.
| Too wrote:
| Is it possible to set it as a global OS setting, inside a
| container?
| mirekrusin wrote:
| Can't it have "if payload is 1 byte (or less than X) then wait,
| otherwise don't" condition?
| chuckadams wrote:
| Some network stacks like those in Solaris and HP/UX let you
| tune the "Nagle limit" in just such a fashion, up to disabling
| it entirely by setting it to 1. I'm not aware of it being
| tunable on Linux, though you can manually control the buffering
| using TCP_CORK. https://baus.net/on-tcp_cork/ has some nice
| details.
| fweimer wrote:
| There is a socket option, SO_SNDLOWAT. It's not implement Linux
| according to the manual page. The description in UNIX Network
| Programming and TCP Illustrated conflict, too. So it's probably
| not useful.
| the8472 wrote:
| You can buffer in userspace. Don't do small writes to the
| socket and no bytes will be sent. Don't do two consecutive
| small writes and nagle won't kick in.
| astrange wrote:
| FreeBSD has accept filters, which let you do something like
| wait for a complete HTTP header (inaccurate from memory
| summary.) Not sure about the sending side.
| deathanatos wrote:
| How is what you're describing not just Nagle's algorithm?
|
| If you mean TCP_NODELAY, you should use it with TCP_CORK, which
| prevents partial frames. TCP_CORK the socket, do your writes to
| the kernel via send, and then once you have an application
| level "message" ready to send out -- i.e., once you're at the
| point where you're going to go to sleep and wait for the other
| end to respond, unset TCP_CORK & then go back to your event
| loop & sleep. The "uncork" at the end + nodelay sends the final
| partial frame, if there is one.
| elhosots wrote:
| This sounds like the root of my vncviewer / server interaction
| bugs i experience with some vnc viewer/server combo's between
| ubuntu linux and freebsd... (tight/tiger)
| evanelias wrote:
| John Nagle has posted insightful comments about the historical
| background for this many times, for example
| https://news.ycombinator.com/item?id=9048947 referenced in the
| article. He's a prolific HN commenter (#11 on the leaderboard) so
| it can be hard to find everything, but some more comments
| searchable via
| https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...
| or
| https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...
| Animats wrote:
| The sending pattern matters. Send/Receive/Send/Receive won't
| trigger the problem, because the request will go out
| immediately and the reply will provide an ACK and allow another
| request. Bulk transfers won't cause the problem, because if you
| fill the outgoing block size, there's no delay.
|
| But Send/Send/Receive will. This comes up a lot in game
| systems, where most of the traffic is small events going one
| way.
| pipe01 wrote:
| I would imagine that games that require exotic sending
| patterns would use UDP, giving them more control over the
| protocol
| codr7 wrote:
| Size prefixed messages are pretty common, perfectly
| possible to perform as one send but takes more work.
| EvanAnderson wrote:
| I love it when Nagle's algorithm comes up on HN. Inevitably
| someone, not knowing "Animats" is John Nagle, responds a
| comment from Animats with a "knowing better" tone. >smile<
|
| (I also really like Animats' comments, too.)
| geoelectric wrote:
| I have to confess that when I saw this post, I quickly
| skimmed the threads to check if someone was trying to educate
| Animats on TCP. Think I've only seen that happen in the wild
| once or twice, but it absolutely made my day when it did.
| ryandrake wrote:
| It's always the highlight of my day when it happens, almost
| as nice as when someone chimes in to educate John Carmack
| on 3D graphics and VR technology.
| userbinator wrote:
| I always check if the man himself makes an appearance every
| time I see that. He has posted a few comments in here
| already.
| jeltz wrote:
| It is like when someone here accused Andres Freund
| (PostgreSQL core dev who recently became famous due to the xz
| backdoor) of Dunning-Kruger when he had commented on
| something related to PostgreSQL's architecture which he had
| spent many many hours working on personally (I think it was
| pluggable storage).
|
| Maybe you just tried to educate the leading expert in the
| world on his own expertise. :D
| SushiHippie wrote:
| FYI the best way to filter by author is 'author:Animats' this
| will only show results from the user Animats and won't match
| animats inside the comment text.
|
| https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...
| pclmulqdq wrote:
| In a world where bandwidth was limited, and the packet size
| minimum was 64 bytes plus an inter-frame gap (it still is for
| most Ethernet networks), sending a TCP packet for literally every
| byte wasted a huge amount of bandwidth. The same goes for sending
| empty acks.
|
| On the other hand, my general position is: it's not TCP_NODELAY,
| it's TCP.
| metadaemon wrote:
| I'd just love a protocol that has a built in mechanism for
| realizing the other side of the pipe disconnected for any
| reason.
| koverstreet wrote:
| Like TCP keepalives?
| mort96 wrote:
| If the feature already technically exists in TCP, it's
| either broken or disabled by default, which is pretty much
| the same as not having it.
| voxic11 wrote:
| keepalives are an optional TCP feature so they are not
| necessarily supported by all TCP implementations and
| therefor default to off even when supported.
| dilyevsky wrote:
| Where is it off? Most linux distros have it on it's just
| the default kickoff timer is ridiculously long (like 2
| hours iirc). Besides, TCP keepalives won't help with the
| issue at hand and were put in for totally different
| purpose (gc'ing idle connections). Most of the time you
| don't even need them because the other side will send RST
| packet if it already closed the socket.
| halter73 wrote:
| AFAIK, all Linux distros plus Windows and macOS have TCP
| keepalives off by default as mandated by the RFC 1122.
| Even when they are optionally turned on using
| SO_KEEPALIVE, the interval defaults to two hours because
| that is the minimum default interval allowed by spec.
| That can then be optionally reduced with something like
| /proc/sys/net/ipv4/tcp_keepalive_time (system wide) or
| TCP_KEEPIDLE (per socket).
|
| By default, completely idle TCP connections will stay
| alive indefinitely from the perspective of both peers
| even if their physical connection is severed.
| Implementors MAY include "keep-alives" in their TCP
| implementations, although this practice is not
| universally accepted. If keep-alives are
| included, the application MUST be able to
| turn them on or off for each TCP connection, and
| they MUST default to off. Keep-alive
| packets MUST only be sent when no data or
| acknowledgement packets have been received for the
| connection within an interval. This interval MUST be
| configurable and MUST default to no less than two hours.
|
| [0]:
| https://datatracker.ietf.org/doc/html/rfc1122#page-101
| dilyevsky wrote:
| OK you're right - it's coming back to me now. I've been
| spoiled by software that enables keep-alive on sockets.
| mort96 wrote:
| So we need a protocol with some kind of non-optional
| default-enabled keepalive.
| josefx wrote:
| Now your connections start to randomly fail in production
| because the implementation defaults to 20ms and your
| local tests never caught that.
| mort96 wrote:
| I'm sure there's some middle ground between "never time
| out" and "time out after 20ms" that works reasonably well
| for most use cases
| hi-v-rocknroll wrote:
| You're conflating all optional TCP features of all
| operating systems, network devices, and RFCs together.
| This lack of nuance fails to appreciate that different
| applications have different needs for how they use TCP: (
| server | client ) x ( one way | chatty bidirectional |
| idle tinygram | mixed ). If a feature needs to be used on
| a particular connection, then use it. ;)
| the8472 wrote:
| If a socket is closed properly there'll be a FIN and the
| other side can learn about it by polling the socket.
|
| If the network connection is lost due to external
| circumstances (say your modem crashes) then how would that
| information propagate from the point of failure to the remote
| end _on an idle connection_? Either you actively probe
| (keepalives) and risk false positives or you wait until you
| hear again from the other side, risking false negatives.
| sophacles wrote:
| It gets even worse - routing changes causing traffic to
| blackhole would still be undetectable without a timeout
| mechanism, since probes and responses would be lost.
| dataflow wrote:
| > If the network connection is lost due to external
| circumstances (say your modem crashes) then how would that
| information propagate from the point of failure to the
| remote end _on an idle connection_?
|
| Observe the line voltage? If it gets cut then you have a
| problem...
|
| > Either you actively probe (keepalives) and risk false
| positives
|
| What false positives? Are you thinking there's an adversary
| on the other side?
| pjc50 wrote:
| This is a L2 vs L3 thing.
|
| Most network links absolutely will detect that the link
| has gone away; the little LED will turn off and the OS
| will be informed on both ends of that link.
|
| But one of the link ends is a router, and these are
| (except for NAT) _stateless_. The router _does not know_
| what TCP connections are currently running through it, so
| it cannot notify them - until a packet for that link
| arrives, at which point it can send back an ICMP packet.
|
| A TCP link with no traffic on it _does not exist_ on the
| intermediate routers.
|
| (Direct contrast to the old telecom ATM protocol, which
| was circuit switched and required "reservation" of a full
| set of end-to-end links).
| ncruces wrote:
| For a given connection, (most) packages might go through
| (e.g.) 10 links. If one link goes down (or is saturated
| and dropping packets) the connection is supposed to route
| around it.
|
| So, except for the links on either of end going down (one
| end really, if the other is on a "data center" the TCP
| connection is likely terminated in a "server" with
| redundant networking) you wouldn't want to have a
| connection terminated just because a link died.
|
| That's explicitly against the goal of a packed switched
| network.
| toast0 wrote:
| That's possible in circuit switched networking with various
| types of supervision, but packet switched networking has
| taken over because it's much less expensive to implement.
|
| Attempts to add connection monitoring usually make things
| worse --- if you need to reroute a cable, and one or both
| ends of the cable will detect a cable disconnection and close
| user sockets, that's not great, now you do a quick change
| with a small period of data loss but otherwise minor
| interruption; all of the established connections will be
| dropped.
| noselasd wrote:
| SCTP has hearbeats to detect that.
| sophacles wrote:
| That's really really hard. For a full, guaranteed way to do
| this we'd need circuit switching (or circuit switching
| emulation). It's pretty expensive to do in packet networks -
| each flow would need to be tracked by each middle box, so a
| lot more RAM at every hop, and probably a lot more processing
| power. If we go with circuit establishment, its also kind of
| expensive and breaks the whole "distributed, decentralized,
| self-healing network" property of the Internet.
|
| It's possible to do better than TCP these days, bandwidth is
| much much less constrained than it was when TCP was designed,
| but it's still a hard problem to do detection of pipe
| disconnected for _any_ reason other than timeouts (which we
| already have).
| pclmulqdq wrote:
| Several of the "reliable UDP" protocols I have worked on in
| the past have had a heartbeat mechanism that is specifically
| for detecting this. If you haven't sent a packet down the
| wire in 10-100 milliseconds, you will send an extra packet
| just to say you're still there.
|
| It's very useful to do this in intra-datacenter protocols.
| 01HNNWZ0MV43FF wrote:
| To re-word everyone else's comments - "Disconnected" is not
| well-defined in any network.
| dataflow wrote:
| > To re-word everyone else's comments - "Disconnected" is
| not well-defined in any network.
|
| Parent said disconnected pipe, not network. It's
| sufficiently well-definable there.
| Spivak wrote:
| I think it's a distinction without a difference in this
| case. You can't know if the reason your water stopped is
| because the water is shut off, the pipe broke, or it's
| just slow.
|
| When all you have to go on is "I stopped getting packets"
| the best you can do is give up after a bit. TCP
| keepsalives do kinda suck and are full of interesting
| choices that don't seem to have passed the test of time.
| But they are there and if you control both sides of the
| connection you can be sure they work.
| dataflow wrote:
| There's a crucial difference in fact, which is that the
| peer you're defining connectedness to is a single well-
| defined peer that is directly connected to you, which
| "The Network" is not.
|
| As for the analogy, uh, this ain't water. Monitor the
| line voltage or the fiber brightness or something, it'll
| tell you very quickly if the other endpoint is
| disconnected. It's up to the physical layer to provide a
| mechanism to detect disconnection, but it's not somehow
| impossible or rocket science...
| umanwizard wrote:
| Well, isn't that already how it works? If I physically
| unplug my ethernet cable, won't TCP-related syscalls
| start failing immediately?
| dataflow wrote:
| Probably, but I don't know how the physical layers work
| underneath. But regardless, it's trivial to just monitor
| _something_ constantly to ensure the connection is still
| there, you just need the hardware and protocol support.
| pjc50 wrote:
| Modern Ethernet has what it calls "fast link pulses";
| every 16ms there's some traffic to check. It's telephones
| that use voltage for hook detection.
|
| However, that only applies to the two ends of that cable,
| not between you and the datacentre on the other side of
| the world.
| remram wrote:
| > I don't know how it works... it's trivial
|
| Come on now...
|
| And it is easy to monitor, it is just an application
| concern not a L3-L4 one.
| pjc50 wrote:
| Last time I looked the behavior differed; some OSs will
| _immediately_ reset TCP connections which were using an
| interface when it goes away, others will wait until a
| packet is attempted.
| iudqnolq wrote:
| I ran into this with websockets. At least under certain
| browser/os pairs you won't ever receive a close event if
| you disconnect from wifi. I guess you need to manually
| monitor ping/pong messages and close it yourself after a
| timeout?
| Spivak wrote:
| In a packet-switched network there isn't one connection
| between you and your peer. Even if you had line
| monitoring that wouldn't be enough on its own to tell you
| that your packet can't get there -- "routing around the
| problem" isn't just a turn of phrase. On the opposite end
| networks are best-effort so even if the line is up you
| might get stuck in congestion which to you might as well
| be dropped.
|
| You can get the guarantees you want with a circuit
| switched network but there's a lot of trade-offs namely
| bandwidth and self-healing.
| pjc50 wrote:
| See https://news.ycombinator.com/item?id=40316922 :
| "pipe" is L3, the network links are L2.
| jallmann wrote:
| These types of keepalives are usually best handled at the
| application protocol layer where you can design in more knobs
| and respond in different ways. Otherwise you may see
| unexpected interactions between different keepalive
| mechanisms in different parts of the protocol stack.
| drb999 wrote:
| What you're looking for is:
| https://datatracker.ietf.org/doc/html/rfc5880
|
| BFD, it's used for millisecond failure detection and
| typically combined with BGP sessions (tcp based) to ensure
| seamless failover without packet drops.
| niutech wrote:
| Shouldn't QUIC (https://en.wikipedia.org/wiki/QUIC) solve the
| TCP issues like latency?
| djha-skin wrote:
| Quic is mostly used between client and data center, but not
| between two datacenter computers. TCP is the better choice
| once inside the datacenter.
|
| Reasons:
|
| _Security Updates_
|
| Phones run old kernels and new apps. So it makes a lot of
| sense to put something that needs updated a lot like the
| network stack into user space, and quic does well here.
|
| Data center computers run older apps on newer kernels, so it
| makes sense to put the network stack into the kernel where
| updates and operational tweaks can happen independent of the
| app release cycle.
|
| _Encryption Overhead_
|
| The overhead of TLS is not always needed inside a data
| center, where it is always needed on a phone.
|
| _Head of Line Blocking_
|
| Super important on a throttled or bad phone connection, not a
| big deal when all of your datacenter servers have 10G
| connections to everything else.
|
| In my opinion TCP is a battle hardened technology that just
| works even when things go bad. That it contains a setting
| with perhaps a poor default is a small thing in comparison to
| its good record for stability in most situations. It's also
| comforting to know I can tweak kernel parameters if I need
| something special for my particular use case.
| mjb wrote:
| Many performance-sensitive in-datacenter applications have
| moved away from TCP to reliable datagram protocols. Here's
| what that looks like at AWS:
| https://ieeexplore.ieee.org/document/9167399
| jallmann wrote:
| The specific issues that this article discusses (eg Nagle's
| algorithm) will be present in most packet-switched transport
| protocols, especially ones that rely on acknowledgements for
| reliability. The QUIC RFC mentions this:
| https://datatracker.ietf.org/doc/html/rfc9000#section-13
|
| Packet overhead, ack frequency, etc are the tip of the
| iceberg though. QUIC addresses some of the biggest issues
| with TCP such as head-of-line blocking but still shares the
| more finicky issues, such as different flow and congestion
| control algorithms interacting poorly.
| klabb3 wrote:
| As someone who needed high throughput and looked to QUIC
| because of control of buffers, I recommend against it at this
| time. It's got tons of performance problems depending on impl
| and the API is different.
|
| I don't think QUIC is bad, or even overengineered, really. It
| delivers useful features, in theory, that are quite well
| designed for the modern web centric world. Instead I got a
| much larger appreciation for TCP, and how well it works
| everywhere: on commodity hardware, middleboxes, autotuning,
| NIC offloading etc etc. Never underestimate battletested
| tech.
|
| In that sense, the lack of TCP_NODELAY is an exception to the
| rule that TCP performs well out of the box (golang is already
| doing this by default). As such, I think it's time to change
| the default. Not using buffers correctly is a programming
| error, imo, and can be patched.
| supriyo-biswas wrote:
| Was this ever implemented though? I found [1] but it was
| frozen due to age and was never worked on, it seems.
|
| (Edit: doing some more reading, it seems TCP_NODELAY was
| always the default in Golang. Enable TCP_NODELAY =>
| "disable Nagle's algorithm")
|
| [1] https://github.com/golang/go/issues/57530
| bboreham wrote:
| Yes. That issue is confusingly titled, but consists
| solely of a quote from the author of the code talking
| about what they were thinking at the time they did it.
| zengid wrote:
| Relevant Oxide and Friends podcast episode
| https://www.youtube.com/watch?v=mqvVmYhclAg
| matthavener wrote:
| This was a great episode and the really drove home the
| importance of visualization.
| rsc wrote:
| Not if you use a modern language that enables TCP_NODELAY by
| default, like Go. :-)
| andrewfromx wrote:
| https://news.ycombinator.com/item?id=34179426
|
| https://github.com/golang/go/issues/57530
|
| huh, TIL.
| silverwind wrote:
| Node.js also does this since at least 2020.
| Sammi wrote:
| Since 2022 v.18.
|
| PR: https://github.com/nodejs/node/pull/42163
|
| Changelog entry: https://github.com/nodejs/node/blob/main/doc
| /changelogs/CHAN...
| eru wrote:
| Why do you need a whole language for that? Couldn't you just
| use a 'modern' networking library?
| rsc wrote:
| Sure, like the one in https://9fans.github.io/plan9port/. :-)
| ironman1478 wrote:
| I've fixed multiple latency issues due to nagle's multiple times
| in my career. It's the first thing I jump to. I feel like the
| logic behind it is sound, but it just doesn't work for some
| workloads. It should be something that an engineer needs to be
| forced to set while creating a socket, instead of letting the OS
| choose a default. I think that's the main issue. Not that it's a
| good / bad option but that there is a setting that people might
| not know about that manipulates how data is sent over the wire so
| aggressively.
| hinkley wrote:
| What you really want is for the delay to be n microseconds, but
| there's no good way to do that except putting your own user
| space buffering in front of the system calls (user space works
| better, unless you have something like io_uring amortizing
| system call times)
| bobmcnamara wrote:
| I'd rather have portable TCP_CORK
| hinkley wrote:
| Cork is probably how you'd implement this in userspace so
| why not both?
| mjevans wrote:
| It'd probably be amazing how many poorly coded games would
| work better if something like...
|
| TCP_60FPSBUFFER
|
| Would wait for ~16mS after the first packet is queued and
| batch the data stream up.
| dishsoap wrote:
| Most games use UDP.
| Chaosvex wrote:
| Adding delay to multiplayer games? That's worse.
| jnordwick wrote:
| linux has auto-corking (and I know of no way to disable it)
| that will do these short delays on small packets even if the
| dev doesn't want it
| Bluecobra wrote:
| I agree, it has been fairly well known to disable Nagle's
| Algorithm in HFT/low latency trading circles for quite some
| time now (like > 15 years). It's one of the first things I look
| for.
| Scubabear68 wrote:
| I was setting TCP_NODELAY at Bear Stearns for custom
| networking code circa 1994 or so.
| kristjansson wrote:
| This is why I love this place
| mcoliver wrote:
| Same in M&E / vfx
| Reason077 wrote:
| Surely serious HFT systems bypass TCP altogether now days. In
| that world, every millisecond of latency can potentially cost
| a lot of money.
|
| These are the guys that use microwave links to connect to
| exchanges because fibre-optics have too much latency.
| nsguy wrote:
| The logic is really for things like Telnet sessions. IIRC that
| was the whole motivation.
| bobmcnamara wrote:
| And for block writes!
|
| The Nagler turns a series of 4KB pages over TCP into a stream
| of MTU sized packets, rather than a short packet aligned to
| the end of each page.
| nailer wrote:
| You're right re: making delay explicit, but also crappy use the
| space networking tools don't show whether no_delay is enabled
| on sockets.
|
| Last time I had to do some Linux stuff, maybe 10 years ago you
| had to write a systemtap program. I guess it's EBNF now. But I
| bet the userspace tools still suck.
| nailer wrote:
| > use the space
|
| Userspace. Sorry, was using voice dictation.
| Sebb767 wrote:
| > It should be something that an engineer needs to be forced to
| set while creating a socket, instead of letting the OS choose a
| default.
|
| If the intention is mostly to fix applications with bad
| `write`-behavior, this would make setting TCP_DELAY a pretty
| exotic option - you would need a software engineer to be both
| smart enough to know to set this option, but not smart enough
| to distribute their write-calls well and/or not go for writing
| their own (probably better fitted) application-specific version
| of Nagles.
| nh2 wrote:
| Same here. I have a hobby that on any RPC framework I
| encounter, I file a Github issue "did you think of TCP_NODELAY
| or can this framework do only 20 calls per second?".
|
| So far, it's found a bug every single time.
|
| Some examples: https://cloud-
| haskell.atlassian.net/browse/DP-108 or
| https://github.com/agentm/curryer/issues/3
|
| I disagree on the "not a good / bad option" though.
|
| It's a kernel-side heuristic for "magically fixing" badly
| behaved applications.
|
| As the article states, no sensible application does 1-byte
| network write() syscalls. Software that does that should be
| fixed.
|
| It makes sense only in the case when you are the kernel
| sysadmin and somehow cannot fix the software that runs on the
| machine, maybe for team-political reasons. I claim that's
| pretty rare.
|
| For all other cases, it makes sane software extra complicated:
| You need to explicitly opt-out of odd magic that makes poorly-
| written software have slightly more throughput, and that makes
| correctly-written software have huge, surprising latency.
|
| John Nagle says here and in linked threads that Delayed Acks
| are even worse. I agree. But the Send/Send/Receive receive
| pattern that Nagle's Algorithm degrades is a totally valid and
| common use case, including anything that does pipelined RPC
| over TCP.
|
| Both Delayed Acks and Nagle's Algorithm should be opt-in, in my
| opinion. It should be called TCP_DELAY, which you can opt-into
| if you can't be asked to implement basic userspace buffering.
|
| People shouldn't /need/ to know about these. Make the default
| case be the unsurprising one.
| carterschonwald wrote:
| Oh hey! It's been a while how're you?!
| a_t48 wrote:
| Thanks for the reminder to set this on the new framework I'm
| working on. :)
| jandrese wrote:
| The problem with making it opt in is that the point of the
| protocol was to fix apps that, while they perform fine for
| the developer on his LAN, would be hell on internet routers.
| So the people who benefit are the ones who don't know what
| they are doing and only use the defaults.
| klabb3 wrote:
| > As the article states, no sensible application does 1-byte
| network write() syscalls. Software that does that should be
| fixed.
|
| Yes! And worse, those that _do_ are not gonna be "fixed" by
| delays either. In this day and age with fast internets, a
| syscall per byte will bottleneck the CPU way before it'll
| saturate the network path. The cpu limit when I've been
| tuning buffers have been somewhere in the 4k-32k range for
| 10Gbps ish.
|
| > Both Delayed Acks and Nagle's Algorithm should be opt-in,
| in my opinion.
|
| Agreed, it causes more problems than it solves and is very
| outdated. Now, the challenge is rolling out such a change as
| smoothly as possible, which requires coordination and a lot
| of trivia knowledge of legacy systems. Migrations are never
| trivial.
| oefrha wrote:
| I doubt the libc default in established systems can change
| now, but newer languages and libraries can learn the lesson
| and do the right thing. For instance, Go sets TCP_NODELAY
| by default: https://news.ycombinator.com/item?id=34181846
| pzs wrote:
| "As the article states, no sensible application does 1-byte
| network write() syscalls." - the problem that this flag was
| meant to solve was that when a user was typing at a remote
| terminal, which used to be a pretty common use case in the
| 80's (think telnet), there was one byte available to send at
| a time over a network with a bandwidth (and latency) severely
| limited compared to today's networks. The user was happy to
| see that the typed character arrived to the other side. This
| problem is no longer significant, and the world has changed
| so that this flag has become a common issue in many current
| use cases.
|
| Was terminal software poorly written? I don't feel
| comfortable to make such judgement. It was designed for a
| constrained environment with different priorities.
|
| Anyway, I agree with the rest of your comment.
| SoftTalker wrote:
| > when a user was typing at a remote terminal, which used
| to be a pretty common use case in the 80's
|
| Still is for some. I'm probably working in a terminal on an
| ssh connection to a remote system for 80% of my work day.
| dgoldstein0 wrote:
| sure, but we do so with much better networks than in the
| 80s. The extra overhead is not going to matter when even
| a bad network nowadays is measured in megabits per second
| per user. The 80s had no such luxury.
| underdeserver wrote:
| If you're working on a distributed system, most of the
| traffic is not going to be your SSH session though.
| adgjlsfhk1 wrote:
| the difference is that with kb/s speed, 40x of 10
| characters per second overhead mattered. now, humans
| aren't nearly fast enough to contest a network.
| hgomersall wrote:
| Would one not also get clobbered by all the sys calls for
| doing many small packets? It feels like coalescing in
| userspace is a much better strategy all round if that's
| desired, but I'm not super experienced.
| imp0cat wrote:
| I have a hobby that on any RPC framework I encounter, I file
| a Github issue "did you think of TCP_NODELAY or can this
| framework do only 20 calls per second?".
|
| So true. Just last month we had to apply the TCP_NODELAY fix
| to one of our libraries. :)
| kazinator wrote:
| It's very easy to end up with small writes. E.g.
| 1. Write four bytes (length of frame) 2. Write the
| frame (write the frame itself)
|
| The easiest fix in C code, with the least chance of introduce
| a buffer overflow or bad performance is to keep these two
| pieces of information in separate buffers, and use writev.
| (How portable is that compared to send?)
|
| If you have to combine the two into one flat frame, you're
| looking at allocating and copying memory.
|
| Linux has something called corking: you can "cork" a socket
| (so that it doesn't transmit), write some stuff to it
| multiple times and "uncork". It's extra syscalls though,
| yuck.
|
| You could use a buffered stream where you control flushes:
| basically another copying layer.
| inopinatus wrote:
| With some vendors you have to solve it like a policy problem,
| via a LD_PRELOAD shim.
| ww520 wrote:
| Same here. My first job out of college was at a database
| company. Queries at the client side of the client-server based
| database were slow. It was thought the database server was slow
| as hardware back then was pretty pathetic. I traced it down to
| the network driver and found out the default setting of
| TCP_NODELAY was off. I looked like a hero when turning on that
| option and the db benchmarks jumped up.
| pjc50 wrote:
| > I feel like the logic behind it is sound, but it just doesn't
| work for some workloads.
|
| The logic is _only_ sound for interactive plaintext typing
| workloads. It should have been turned off by default 20 years
| ago, let alone now.
| p_l wrote:
| Remember that IPv4 original "target replacement date" (as it
| was only an "experimental" protocol) was 1990...
|
| And a common thing in many more complex/advanced protocols
| was to explicitly delineate "messages", which avoids the
| issue of Nagle's algorithm altogether.
| immibis wrote:
| Not when creating a socket - when sending data. When sending
| data, you should indicate whether this data block prefers high
| throughput or low latency.
| kazinator wrote:
| > _that an engineer needs to be forced to set while creating a
| socket_
|
| Because there aren't enough steps in setting up sockets! Haha.
|
| I suspect that what would happen is that many of the
| programming language run-times in the world which have easier-
| to-use socket abstractions would pick a default and hide it
| from the programmer, so as not to expose an extra step.
| JoshTriplett wrote:
| I do wish that TCP_NODELAY was the default, and there was a
| TCP_DELAY option instead. That'd be a world in which people who
| _want_ the batch-style behavior (optimizing for throughput and
| fewer packets at the expense of latency) could still opt into it.
| mzs wrote:
| So do I, but I with there was a new one TCP_RTTDELAY. It would
| take a byte that would be what 128th of RTT you want to use for
| Nagle instead of one RTT or full* buffer. 0 would be the
| default, behaving as you and I prefer.
|
| * "Given the vast amount of work a modern server can do in even
| a few hundred microseconds, delaying sending data for even one
| RTT isn't clearly a win."
|
| I don't think that's such an issue anymore either, given that
| the server produces so much data it fills the output buffer
| quickly anyway, the data is then immediately sent before the
| delay runs its course.
| stonemetal12 wrote:
| >To make a clearer case, let's turn back to the justification
| behind Nagle's algorithm: amortizing the cost of headers and
| avoiding that 40x overhead on single-byte packets. But does
| anybody send single byte packets anymore?
|
| That is a bit of a strawman there. While he uses single byte
| packets as the worst case example, the issue as stated is any not
| full packet.
| somat wrote:
| What about the opposite, disable delayed acks.
|
| The problem is the pathological behavior when tinygram prevention
| interacts with delayed acks. There is an exposed option to turn
| off tinygram prevention(TCP_NODELAY), how would you tun off
| delayed acks instead? Say if you wanted to benchmark all four
| combinations and see what works best.
|
| doing a little research I found:
|
| linux has the TCP_QUICKACK socket option but you have to set it
| every time you receive. there is also
| /proc/sys/net/ipv4/tcp_delack_min and
| /proc/sys/net/ipv4/tcp_ato_min
|
| freebsd has net.inet.tcp.delayed_ack and net.inet.tcp.delacktime
| Animats wrote:
| > linux has the TCP_QUICKACK socket option but you have to set
| it every time you receive
|
| Right. What were they thinking? Why would you want it off only
| some of the time?
| batmanthehorse wrote:
| In CentOS/RedHat you can add `quickack 1` to the end of a route
| to tell it to disable delayed acks for that route.
| rbjorklin wrote:
| And with systemd >= 253 you can set it as part of the network
| config to have it be applied automatically.
| https://github.com/systemd/systemd/issues/25906
| mjb wrote:
| TCP_QUICKACK does fix the worst version of the problem, but
| doesn't fix the entire problem. Nagles algorithm will still
| wait for up to one round-trip time before sending data (at
| least as specified in the RFC), which is extra latency with
| nearly no added value.
| Culonavirus wrote:
| Apparently you have time to "do a little research" but not to
| read the entire article you're reacting to? It specifically
| mentions TCP_QUICKACK.
| benreesman wrote:
| Nagle and no delay are like like 90+% of the latency bugs I've
| dealt with.
|
| Two reasonable ideas that mix terribly in practice.
| tempaskhn wrote:
| Wow, never would have thought of that.
| jedberg wrote:
| This is an interesting thing that points out why abstraction
| layers can be bad without proper message passing mechanisms.
|
| This could be fixed if there was a way for the application at L7
| to tell the TCP stack at L4 "hey, I'm an interactive shell so I
| expect to have a lot of tiny packets, you should leave
| TCP_NODELAY on for these packets" so that it can be off by
| default but on for that application to reduce overhead.
|
| Of course nowadays it's probably an unnecessary optimization
| anyway, but back in '84 it would have been super handy.
| Dylan16807 wrote:
| "I'm an interactive shell so I expect to have a lot of tiny
| packets" is what the delay is _for_. If you want to turn it off
| for those, you should turn it off for everything.
|
| (If you're worried about programs that buffer badly, then you
| could compensate with a 1ms delay. But not this round trip
| stuff.)
| eru wrote:
| The take-away I get is that abstraction layers (in the kernel)
| can be bad.
|
| Operating system kernels should enable secure multiplexing of
| resources. Abstraction and portability should be done via
| libraries.
|
| See https://en.wikipedia.org/wiki/Exokernel
| landswipe wrote:
| Use UDP ;)
| gafferongames wrote:
| Bingo
| hi-v-rocknroll wrote:
| Too many applications end up reinventing TCP or SCTP in user-
| space. Also, network-level QoS applied to unrecognized UDP
| protocols typically means it gets throttled before TCP. Use UDP
| when nothing else will work, when the use-case doesn't need a
| persistent connection, and when no other messaging or transport
| library is suitable.
| epolanski wrote:
| I was curious whether I had to change anything in my applications
| after reading that so did a bit of research.
|
| Both Node.js and Curl use TCP_NODELAY by default from a long
| time.
| Sammi wrote:
| Nodejs enabled TCP_NODELAY by default in 2022 v.18.
|
| PR: https://github.com/nodejs/node/pull/42163
|
| Changelog entry:
| https://github.com/nodejs/node/blob/main/doc/changelogs/CHAN...
| epolanski wrote:
| That's the HTTP module, but it was already moved to NODELAY
| for the 'net' module in 2015.
| gafferongames wrote:
| It's always TCP_NODELAY. Except when it's head of line blocking,
| then it's not.
| kaoD wrote:
| As a counterpoint, here's the story of how for me it wasn't
| TCP_NODELAY: for some reason my Nodejs TCP service was talking a
| few seconds to reply to my requests in localhost (Windows
| machine). After the connection was established everything was
| pretty normal but it consistently took a few seconds to establish
| the connection.
|
| I even downloaded netcat for Windows to go as bare ones as
| possible... and the exact same thing happened.
|
| I rewrote a POC service in Rust and... oh wow, the same thing
| happens.
|
| It took me a very long time of not finding anything on the
| internet (and getting yelled at in Stack Overflow, or rather one
| of its sister sites) and painstakingly debugging (including
| writing my own tiny client with tons of debug statements) until I
| realized "localhost" was resolving first to IPv6 loopback in
| Windows and, only after quietly timing out there (because I was
| only listening on IPv4 loopback), it did try and instantly
| connect through IPv4.
| littlestymaar wrote:
| I've seen this too, but luckily someone one the internet gave
| me a pointer to the exact problem so I didn't have to go deep
| to figure out.
| pandemicsyn wrote:
| I was gonna say its always lro offload but my experience is
| dated.
| 0xbadcafebee wrote:
| The takeaway is odd. Clearly Nagle's Algorithm was an attempt at
| batched writes. It doesn't matter what your hardware or network
| or application or use-case or anything is; in some cases, batched
| writes are better.
|
| Lots of computing today uses batched writes. Network applications
| benefit from it too. Newer higher-level protocols like QUIC do
| batching of writes, effectively moving all of TCP's independent
| connection and error handling into userspace, so the protocol can
| move as much data into the application as fast as it can, and let
| the application (rather than a host tcp/ip stack, router, etc)
| worry about the connection and error handling of individual
| streams.
|
| Once our networks become saturated the way they were in the old
| days, Nagle's algorithm will return in the form of a QUIC
| modification, probably deeper in the application code, to wait to
| send a QUIC packet until some criteria is reached. Everything in
| technology is re-invented once either hardware or software
| reaches a bottleneck (and they always will as their capabilities
| don't grow at the same rate).
|
| (the other case besides bandwidth where Nagle's algorithm is
| useful is if you're saturating Packets Per Second (PPS) from tiny
| packets)
| Spivak wrote:
| Yes but it seems this particular implementation is using a
| heuristic for how to batch that made some assumptions that
| didn't pan out.
| p_l wrote:
| The difference between QUIC and TCP is the original sin of TCP
| (and its predecessor) - that of emulating an async serial port
| connection, with no visible messaging layer.
|
| It meant that you could use a physical teletypewriter to
| connect to services (simplified description - slap a modem on a
| serial port, dial into a TIP, write host address and port
| number, voila), but it also means that TCP has no idea of
| message boundaries, and while you can push some of that
| knowledge now the early software didn't.
|
| In comparison, QUIC and many other non-TCP protocols (SCTP,
| TP4) explicitly provide for messaging boundaries - your
| interface to the system isn't based on emulated serial ports
| but on _messages_ that might at most get reassembled.
| utensil4778 wrote:
| It's kind of incredible to think how many things in computers
| and electronics turn out to just be a serial port.
|
| One day, some future engineer is going to ask why their warp
| core diagnostic port runs at 9600 8n1.
| adgjlsfhk1 wrote:
| batching needs to be application controlled rather than
| protocol controlled. the protocol doesn't have enough context
| to batch correctly.
| ryjo wrote:
| I just ran into this this week implementing a socket library in
| CLIPS. I used Berkley sockets, and before that I had only worked
| with higher-level languages/frameworks that abstracts a lot of
| these concerns away. I was quite confused when Firefox would show
| a "connection reset by peer." It didn't occur to me it could be
| an issue "lower" in the stack. `tcpdump` helped me to observe the
| port and I saw that the server never sent anything before my
| application closed the connection.
| AtNightWeCode wrote:
| Agreed. Another thing along the same path is expect. Needs to be
| disabled in many cloud services.
| meisel wrote:
| Is this something I should also adjust on my personal Ubuntu
| machine for better network performance?
| projectileboy wrote:
| The real issue in modern data centers is TCP. Of course at
| present, we need to know about these little annoyances at the
| application layer, but what we really need is innovation in the
| data center at level 4. And yes I know that many people are
| looking into this and have been for years, but the economic
| motivation clearly has not yet been strong enough. But that may
| change if the public's appetite for LLM-based tooling causes data
| centers to increase 10x (which seems likely).
| maple3142 wrote:
| Not sure if this is a bit off topic or not, but I recently
| encountered a problem where my program are continuously calling
| write to a socket in a loop that loops N times, with each of them
| sending about a few hundred bytes of data representing an
| application-level message. The loop can be understanded as some
| "batched messages" to server. After that, the program will try to
| receive data from server and do some processing.
|
| The problem is that if N is above certain limit (e.g. 4), the
| server will resulting in some error saying that the data is
| truncated somehow. I want to make N larger because the round-trip
| latency is already high enough, so being blocked by this is
| pretty annoying. Eventually, I found an answer on stackoverflow
| saying that setting TCP_NODELAY can fix this, and it actually
| magically enable be to increase N to a larger number like 64 or
| 128 without causing issues. Still not sure why TCP_NODELAY can
| fix this issue and why this problem happens in the first place.
| blahgeek wrote:
| > The problem is that if N is above certain limit (e.g. 4), the
| server will resulting in some error saying that the data is
| truncated somehow.
|
| Maybe your server expects full application-level messages from
| single "recv" call? This is is not correct. A message may be
| spited across multiple recv buffers.
| gizmo686 wrote:
| My guess would be that the server assumes that every call to
| recv() terminates on a message boundary.
|
| With TCP_NODELAY and small messages, this works out fine. Every
| message is contained in a single packet, and the userspace
| buffer being read into is large enough to contain it. As such,
| whenever the kernel has any data to give to userspace, it has
| an integer number of messages to give. Nothing requires the
| kernel to respect that, but it will not go out of its way to
| break it.
|
| In contrast, without TCP_NODELAY, messages get concatenated and
| then fragmented based on where packet boundaries occur. Now,
| the natural end point for a call to recv() is not the message
| boundary, but the packet boundary.
|
| The server is supposed to see that it is in the middle of a
| message, and make another call to recv() to get the rest of it;
| but clearly it does not do that.
| caf wrote:
| Otherwise known as the "TCP is a stream-based abstraction,
| not a packet-based abstraction" bug.
|
| A related one is failing to process the second of two
| complete commands that happen to arrive in the same recv()
| call.
| lanstin wrote:
| I find these bugs to be a sign that the app is not using a
| good wrapper but just mostly gets lucky that the packet
| isn't split randomly on the way.
| ramblemonkey wrote:
| What if we changed the kernel or tcp stack to hold on to the
| packet for only a short time before sending it out. This could
| allow you to balance the latency against the network cost of many
| small packets. The tcp stack could even do it dynamically if
| needed.
| tucnak wrote:
| Genius
| Ono-Sendai wrote:
| From my blog > 10 years ago but sadly still relevant: "Sockets
| should have a flushHint() API call.":
| https://forwardscattering.org/post/3
| hi-v-rocknroll wrote:
| Apropos repost from 2015:
|
| > That still irks me. The real problem is not tinygram
| prevention. It's ACK delays, and that stupid fixed timer. They
| both went into TCP around the same time, but independently. I did
| tinygram prevention (the Nagle algorithm) and Berkeley did
| delayed ACKs, both in the early 1980s. The combination of the two
| is awful. Unfortunately by the time I found about delayed ACKs, I
| had changed jobs, was out of networking, and doing a product for
| Autodesk on non-networked PCs.
|
| > Delayed ACKs are a win only in certain circumstances - mostly
| character echo for Telnet. (When Berkeley installed delayed ACKs,
| they were doing a lot of Telnet from terminal concentrators in
| student terminal rooms to host VAX machines doing the work. For
| that particular situation, it made sense.) The delayed ACK timer
| is scaled to expected human response time. A delayed ACK is a bet
| that the other end will reply to what you just sent almost
| immediately. Except for some RPC protocols, this is unlikely. So
| the ACK delay mechanism loses the bet, over and over, delaying
| the ACK, waiting for a packet on which the ACK can be
| piggybacked, not getting it, and then sending the ACK, delayed.
| There's nothing in TCP to automatically turn this off. However,
| Linux (and I think Windows) now have a TCP_QUICKACK socket
| option. Turn that on unless you have a very unusual application.
|
| > Turning on TCP_NODELAY has similar effects, but can make
| throughput worse for small writes. If you write a loop which
| sends just a few bytes (worst case, one byte) to a socket with
| "write()", and the Nagle algorithm is disabled with TCP_NODELAY,
| each write becomes one IP packet. This increases traffic by a
| factor of 40, with IP and TCP headers for each payload. Tinygram
| prevention won't let you send a second packet if you have one in
| flight, unless you have enough data to fill the maximum sized
| packet. It accumulates bytes for one round trip time, then sends
| everything in the queue. That's almost always what you > want. If
| you have TCP_NODELAY set, you need to be much more aware of
| buffering and flushing issues.
|
| > None of this matters for bulk one-way transfers, which is most
| HTTP today. (I've never looked at the impact of this on the SSL
| handshake, where it might matter.)
|
| > Short version: set TCP_QUICKACK. If you find a case where that
| makes things worse, let me know.
|
| > John Nagle
|
| (2015)
|
| https://news.ycombinator.com/item?id=10608356
|
| ---
|
| Support platform survey:
|
| TCP_QUICKACK: Linux (must set again every every recv())
|
| TCP_NODELAY: Linux, Apple, Windows, Solaris, FreeBSD, OpenBSD,
| and NetBSD
|
| References:
|
| https://www.man7.org/linux/man-pages/man7/tcp.7.html
|
| https://opensource.apple.com/source/xnu/xnu-1504.9.17/bsd/ne...
|
| https://learn.microsoft.com/en-us/windows/win32/winsock/ippr...
|
| https://docs.oracle.com/cd/E88353_01/html/E37851/esc-tcp-4p....
|
| https://man.freebsd.org/cgi/man.cgi?query=tcp
|
| https://man.openbsd.org/tcp
|
| https://man.netbsd.org/NetBSD-8.0/tcp.4
| resonious wrote:
| ~15 years ago I played an MMO that was very real-time, and yet
| all of the communication was TCP. Literally you'd click a button,
| and you would not even see your action play out until a response
| packet came back.
|
| All of the kids playing this game (me included) eventually
| figured out you could turn on TCP_NODELAY to make the game
| buttery smooth - especially for those in California close to the
| game servers.
| jonathanlydall wrote:
| Not sure if you're talking about WoW, but around that time ago
| an update to the game did exactly this change (and possibly
| more).
|
| An interesting side-effect of this was that before the change
| if something stalled the TCP stream, the game would hang for a
| while then very quickly replay all the missed incoming events
| (which was very often you being killed). After the change you'd
| instead just be disconnected.
| mst wrote:
| I think I have a very vague memory of the "hang, hang, hang,
| SURPRISE! You're dead" thing happening in Diablo II but it's
| been so long I wouldn't bet on having remembered correctly.
| trollied wrote:
| Brings back memories. This was a big Sybase performance win back
| in the day.
| f1shy wrote:
| If you are sooo worried about latency, maybe TCP is a bad choice
| to start with... I hate to see people using TCP for all, without
| the minimal understanding of what problems TCP wants to solve,
| and specially which dont.
| pcai wrote:
| TCP solves for "when i send a message i want the other side to
| actually receive it" which is...fairly common
| adgjlsfhk1 wrote:
| tcp enforces a much stricter ordering than desirable (head of
| line blocking). quic does a much better job of emulating a
| stream of independent tasks.
| kazinator wrote:
| Here is the thing. Nagle's and the delayed ACK may suck for
| individual app performance, but fewer packets on the network is
| better for the entire network.
___________________________________________________________________
(page generated 2024-05-10 23:01 UTC)