[HN Gopher] It's always TCP_NODELAY
___________________________________________________________________
It's always TCP_NODELAY
Author : todsacerdoti
Score : 333 points
Date : 2024-05-09 17:54 UTC (5 hours ago)
(HTM) web link (brooker.co.za)
(TXT) w3m dump (brooker.co.za)
| theamk wrote:
| I don't by the reasoning for never needing Nagle anymore. Sure,
| telnet isn't a thing today, but I bet there are still plenty of
| apps which do equivalent of: write(fd, "Host:
| ") write(fd, hostname) write(fd, "\r\n")
| write(fd, "Content-type: ") etc...
|
| this may not be 40x overhead, but it'd still 5x or so.
| otterley wrote:
| Marc addresses that: "That's going to make some "write every
| byte" code slower than it would otherwise be, but those
| applications should be fixed anyway if we care about
| efficiency."
| Arnt wrote:
| Those aren't the ones you debug, so they won't be seen by OP.
| Those are the ones you don't need to debug because Nagle saves
| you.
| rwmj wrote:
| The comment about telnet had me wondering what openssh does,
| and it sets TCP_NODELAY on every connection, even for
| interactive sessions. (Confirmed by both reading the code and
| observing behaviour in 'strace').
| c0l0 wrote:
| _Especially_ for interactive sessions, it absolutely should!
| :)
| syncsynchalt wrote:
| Ironic since Nagle's Algorithm (which TCP_NODELAY disables)
| was invented for interactive sessions.
|
| It's hard to imagine interactive sessions making more than
| the tiniest of blips on a modern network.
| temac wrote:
| Fix the apps. Nobody expect magical perf if you do that when
| writing to files, even though the OS also has its own buffers.
| There is no reason to expect otherwise when writing to a socket
| and actually nagle already doesn't save you from syscall
| overhead.
| toast0 wrote:
| Nagle doesn't save the derpy side from syscall overhead, but
| it would save the other side.
|
| It's not just apps doing this stuff, it also lives in system
| libraries. I'm still mad at the Android HTTPS library for
| sending chunked uploads as so many tinygrams. I don't
| remember exactly, but I think it's reasonable packetization
| for the data chunk (if it picked a reasonable size anyway),
| then one packet for \r\n, one for the size, and another for
| another \r\n. There's no reason for that, but it doesn't hurt
| the client enough that I can convince them to avoid the
| system library so they can fix it and the server can manage
| more throughput. Ugh. (It might be that it's just the TLS
| packetization that was this bogus and the TCP packetization
| was fine, it's been a while)
|
| If you take a pcap for some specific issue, there's always so
| many of these other terrible things in there. </rant>
| meinersbur wrote:
| Those are the apps are quickly written and do not care if
| they unnecessarily congest the network. The ones that do get
| properly maintained can set TCP_NODELAY. Seems like a
| reasonable default to me.
| ale42 wrote:
| Apps can always misbehave, you never know what people
| implement, and you don't always have source code to patch. I
| don't think the role of the OS is to let the apps do whatever
| they wish, but it should give the possibility of doing it if
| it's needed. So I'd rather say, if you know you're properly
| doing things and you're latency sensitive, just TCP_NODELAY
| on all your sockets and you're fine, and nobody will blame
| you about doing it.
| grishka wrote:
| And they really shouldn't do this. Even disregarding the
| network aspect of it, this is still bad for performance because
| syscalls are kinda expensive.
| jrockway wrote:
| Does this matter? Yes, there's a lot of waste. But you also
| have a 1Gbps link. Every second that you don't use the full
| 1Gbps is also waste, right?
| tedunangst wrote:
| This is why I always pad out the end of my html files with a
| megabyte of . A half empty pipe is a half wasted pipe.
| dessimus wrote:
| Just be sure HTTP Compression is off though, or you're
| still half-wasting the pipe.
|
| Better to just dump randomized uncompressible data into
| html comments.
| arp242 wrote:
| I am finally starting to understand some of these
| OpenOffice/LibreOffice commit messages like
| https://github.com/LibreOffice/core/commit/a0b6744d3d77
| eatonphil wrote:
| I imagine the write calls show up pretty easily as a bottleneck
| in a flamegraph.
| wbl wrote:
| They don't. Maybe if you're really good you notice the higher
| overhead but you expect to be spending time writing to the
| network. The actual impact shows up when the bandwidth
| consumption is way up on packet and TCP headers which won't
| show on a flamegraph that easily.
| silisili wrote:
| We shouldn't penalize the internet at large because some
| developers write terrible code.
| littlestymaar wrote:
| Isn't it how SMTP is working though?
| leni536 wrote:
| No?
| loopdoend wrote:
| Ah yeah I fixed this exact bug in net-http in Ruby core a
| decade ago.
| tptacek wrote:
| The discussion here mostly seems to miss the point. The
| argument is to _change the default_ , not to eliminate the
| behavior altogether.
| the8472 wrote:
| shouldn't autocorking help with even without nagle?
| asveikau wrote:
| I don't think that's actually super common anymore when you
| consider that doing asynchronous I/O, the only sane way to do
| that is put it into a buffer rather than blocking at every
| small write(2).
|
| Then you consider that asynchronous I/O is usually necessary
| both on server (otherwise you don't scale well) and client
| (because blocking on network calls is terrible experience,
| especially in today's world of frequent network changes,
| falling out of network range, etc.)
| sophacles wrote:
| TCP_CORK handles this better than nagle tho.
| mannyv wrote:
| We used to call them "packlets."
|
| His "tinygrams" is pretty good too, but that sort of implies UDP
| (D -> datagrams)
| chuckadams wrote:
| > We used to call them "packlets."
|
| setsockopt(fd, IPPROTO_TCP, TCP_MAKE_IT_GO, &go, sizeof(go));
| obelos wrote:
| Not every time. Sometimes it's DNS.
| jeffrallen wrote:
| Once every 50 years and 2 billion kilometers, it's a failing
| memory chip. But you can usually just patch around them, so no
| big deal.
| skunkworker wrote:
| Don't forget BGP or running out of disk space without an alert.
| p_l wrote:
| Once it was a failing line card in router zeroing last bit in
| IPv4 addresses, resulting in ticket about "only even IPv4
| addresses are accessible" ...
| jcgrillo wrote:
| For some reason this reminded me of the "500mi email" bug
| [1], maybe a similar level of initial apparent absurdity?
|
| [1] https://www.ibiblio.org/harris/500milemail.html
| chuckadams wrote:
| The most absurd thing to me about the 500 mile email
| situation is that sendmail just happily started up and
| soldiered on after being given a completely alien config
| file. Could be read as another example of "be liberal in
| what you accept" going awry, but sendmail's wretched config
| format is really a volume of war stories all its own...
| jcgrillo wrote:
| Configuration changes are one of those areas where having
| some kind of "are you sure? (y/n)" check can really pay
| off. It wouldn't have helped in this case, because there
| wasn't really any change management process to speak of,
| but we haven't fully learned the lesson yet.
| unconed wrote:
| Confirmations are mostly useless unless you explicitly
| spell out the implications of the change. They are also
| inferior to being able to undo changes.
|
| That's a lesson many don't know.
| marcosdumay wrote:
| When it fails, it's DNS. When it just stops moving, it's either
| TCP_NODELAY or stream buffering.
|
| Really complex systems (the Web) also fail because of caching.
| drivers99 wrote:
| Or SELinux
| rickydroll wrote:
| Not every time. Sometimes, the power cord is only connected at
| one end.
| sophacles wrote:
| One time for me it was: the glass was dirty.
|
| Some router near a construction site had dust settle into the
| gap between the laser and the fiber, and it attenuated the
| signal enough to see 40-50% packet loss.
|
| We figured out where the loss was and had our NOC email the
| relevant transit provider. A day later we got an email back
| from the tech they dispatched with the story.
| Sohcahtoa82 wrote:
| I chuckle whenever I see this meme, because in my experience,
| the issue is usually DHCP.
| batmanthehorse wrote:
| Does anyone know of a good way to enable TCP_NODELAY on sockets
| when you don't have access to the source for that application? I
| can't find any kernel settings to make it permanent, or commands
| to change it after the fact.
|
| I've been able to disable delayed acks using `quickack 1` in the
| routing table, but it seems particularly hard to enable
| TCP_NODELAY from outside the application.
|
| I've been having exactly the problem described here lately, when
| communicating between an application I own and a closed source
| application it interacts with.
| tedunangst wrote:
| LD_PRELOAD.
| batmanthehorse wrote:
| Thank you, found this: https://github.com/sschroe/libnodelay
| coldpie wrote:
| Would some kind of LD_PRELOAD interception for socket(2) work?
| Call the real function, then do setsockopt or whatever, and
| return the modified socket.
| cesarb wrote:
| > Would some kind of LD_PRELOAD interception for socket(2)
| work?
|
| That would only work if the call goes through libc, and it's
| not statically linked. However, it's becoming more and more
| common to do system calls directly, bypassing libc; the Go
| language is infamous for doing that, but there's also things
| like the rustix crate for Rust
| (https://crates.io/crates/rustix), which does direct system
| calls by default.
| zbowling wrote:
| And go is wrong for doing that, at least on Linux. It
| bypasses optimizations in the vDSO in some cases. On
| Fuchsia, we made direct syscalls not through the vDSO
| illegal and it was funny the hacks to go that required. The
| system ABI of Linux really isn't the syscall interface, its
| the system libc. That's because the C ABI (and the
| behaviors of the triple it was compiled for) and its isms
| for that platform are the linga franca of that system.
| Going around that to call syscalls directly, at least for
| the 90% of useful syscalls on the system that are wrapped
| by libc, is asinine and creates odd bugs, makes crash
| reporters heuristical unwinders, debuggers, etc all more
| painful to write. It also prevents the system vendor from
| implementing user mode optimizations that avoid mode and
| context switches when necessary. We tried to solve these
| issues in Fuchsia, but for Linux, Darwin, and hell, even
| Windows, if you are making direct syscalls and it's not for
| something really special and bespoke, you are just flat-out
| wrong.
| JoshTriplett wrote:
| > The system ABI of Linux really isn't the syscall
| interface, its the system libc.
|
| You might have reasons to prefer to use libc; some
| software has reason to not use libc. Those preferences
| are in conflict, but one of them is not automatically
| right and the other wrong in all circumstances.
|
| Many UNIX systems _did_ follow the premise that you
| _must_ use libc and the syscall interface is unstable.
| Linux pointedly did not, and decided to have a stable
| syscall ABI instead. This means it 's possible to have
| multiple C libraries, as well as other libraries, which
| have different needs or goals and interface with the
| system differently. That's a _useful_ property of Linux.
|
| There are a couple of established mechanism on Linux for
| intercepting syscalls: ptrace, and BPF. If you want to
| intercept all uses of a syscall, intercept the syscall.
| If you want to intercept a particular glibc function _in
| programs using glibc_ , or for that matter a musl
| function in a program using musl, go ahead and use
| LD_PRELOAD. But the Linux syscall interface is a valid
| and stable interface to the system, and that's why
| LD_PRELOAD is not a complete solution.
| zbowling wrote:
| It's true that Linux has a stable-ish syscall table. What
| is funny is that this caused a whole series of Samsung
| Android phones to reboot randomly with some apps because
| Samsung added a syscall at the same position someone else
| did in upstream linux and folks staticly linking their
| own libc to avoid boionc libc were rebooting phones when
| calling certain functions because the Samsung syscall
| causing kernel panics when called wrong. Goes back to it
| being a bad idea to subvert your system libc. Now, distro
| vendors do give out multiple versions of a libc that all
| work with your kernel. This generally works. When we had
| to fix ABI issues this happened a few times. But I
| wouldn't trust building our libc and assuming that libc
| is portable to any linux machine to copy it to.
| cesarb wrote:
| > It's true that Linux has a stable-ish syscall table.
|
| It's not "stable-ish", it's fully stable. Once a syscall
| is added to the syscall table on a released version of
| the official Linux kernel, it might later be replaced by
| a "not implemented" stub (which always returns -ENOSYS),
| but it will never be reused for anything else. There's
| even reserved space on some architectures for the STREAMS
| syscalls, which were AFAIK never on any released version
| of the Linux kernel.
|
| The exception is when creating a new architecture; for
| instance, the syscall table for 32-bit x86 and 64-bit x86
| has a completely different order.
| withinboredom wrote:
| I think what they meant (judging by the example you
| ignored) is that the table changes (even if append-only)
| and you don't know which version you actually have when
| you statically compile your own version. Thus, your
| syscalls might be using a newer version of the table but
| it a) not actually be implemented, or b) implemented with
| something bespoke.
| tedunangst wrote:
| The complication with the linux syscall interface is that
| it turns the worse is better up to 11. Like setuid works
| on a per thread basis, which is seriously not what you
| want, so every program/runtime must do this fun little
| thread stop and start and thunk dance.
| JoshTriplett wrote:
| Yeah, agreed. One of the items on my _long_ TODO list is
| adding `setuid_process` and `setgid_process` and similar,
| so that perhaps a decade later when new runtimes can
| count on the presence of those syscalls, they can stop
| duplicating that mechanism in userspace.
| toast0 wrote:
| > The system ABI of Linux really isn't the syscall
| interface, its the system libc.
|
| Which one? The Linux Kernel doesn't provide a libc. What
| if you're a static executable?
|
| Even on Operating Systems with a libc provided by the
| kernel, it's almost always allowed to upgrade the kernel
| without upgrading the userland (including libc); that
| works because the interface between userland and kernel
| is syscalls.
|
| That certainly ties something that makes syscalls to a
| narrow range of kernel versions, but it's not as if
| dynamically linking libc means your program will be
| compatible forever either.
| jimmaswell wrote:
| > That certainly ties something that makes syscalls to a
| narrow range of kernel versions
|
| I don't think that's right, wouldn't it be the earliest
| kernel supporting that call and onwards? The Linux ABI
| intentionally never breaks userland.
| toast0 wrote:
| In the case where you're running an Operating System that
| provides a libc and is OK with removing older syscalls,
| there's a beginning and an end to support.
|
| Looking at FreeBSD under /usr/include/sys/syscall.h,
| there's a good number of retired syscalls.
|
| On Linux under /usr/include/x86_64-linux-
| gnu/asm/unistd_32.h I see a fair number of missing
| numbers --- not sure what those are about, but 222, 223,
| 251, 285, and 387-392 are missing. (on Debian 12.1 with
| linux-image-6.1.0-12-amd64 version 6.1.52-1, if it
| matters)
| assassinator42 wrote:
| The proliferation of Docker containers seems to go
| against that. Those really only work well since the
| kernel has a stable syscall ABI. So much so that you see
| Microsoft switching to a stable syscall ABI with Windows
| 11.
| sophacles wrote:
| Linux is also weird because there are syscalls not
| supported in most (any?) libc - things like io_uring, and
| netlink fall into this.
| gpderetta wrote:
| Futex for a very long time was only accessible via
| syscall.
| Thaxll wrote:
| Those are very strong words...
| leni536 wrote:
| It should be possible to use vDSO without libc, although
| probably a lot of work.
| pie_flavor wrote:
| You seem to be saying 'it was incorrect on Fuchsia, so
| it's incorrect on Linux'. No, it's correct on Linux, and
| incorrect on every other platform, as each platform's
| documentation is very clear on. Go did it incorrectly on
| FreeBSD, but that's Go being Go; they did it in the first
| place because it's a Linux-first system and it's correct
| on Linux. And glibc does not have any special privilege,
| the vdso optimizations it takes advantage of are just as
| easily taken advantage of by the Go compiler. There's no
| reason to bucket Linux with Windows on the subject of
| syscalls when the Linux manpages are very clear that
| syscalls are there to be used and exhaustively documents
| them, while MSDN is very clear that the system interface
| is kernel32.dll and ntdll.dll, and shuffles the syscall
| numbers every so often so you don't get any funny ideas.
| asveikau wrote:
| Linux doesn't even have consensus on what libc to use,
| and ABI breakage between glibc and musl is not unheard
| of. (Probably not for syscalls but for other things.)
| praptak wrote:
| Attach debugger (ptrace), call setsockopt?
| the8472 wrote:
| opening `/proc/<pid>/fd/<fd number>` and setting the socket
| option may work (not tested)
| tuetuopay wrote:
| you could try ebpf and hook on the socket syscall. might be
| harder than LD_PRELOAD as suggested by other commenters though
| jdadj wrote:
| Depending on the specifics, you might be able to add socat in
| the middle.
|
| Instead of: your_app --> server
|
| you'd have: your_app -> localhost_socat -> server
|
| socat has command line options for setting tcp_nodelay. You'd
| need to convince your closed source app to connect to
| localhost, though. But if it's doing a dns lookup, you could
| probably convince it to connect to localhost with an /etc/hosts
| entry
|
| Since your app would be talking to socat over a local socket,
| the app's tcp_nodelay wouldn't have any effect.
| mirekrusin wrote:
| Can't it have "if payload is 1 byte (or less than X) then wait,
| otherwise don't" condition?
| chuckadams wrote:
| Some network stacks like those in Solaris and HP/UX let you
| tune the "Nagle limit" in just such a fashion, up to disabling
| it entirely by setting it to 1. I'm not aware of it being
| tunable on Linux, though you can manually control the buffering
| using TCP_CORK. https://baus.net/on-tcp_cork/ has some nice
| details.
| fweimer wrote:
| There is a socket option, SO_SNDLOWAT. It's not implement Linux
| according to the manual page. The description in UNIX Network
| Programming and TCP Illustrated conflict, too. So it's probably
| not useful.
| the8472 wrote:
| You can buffer in userspace. Don't do small writes to the
| socket and no bytes will be sent. Don't do two consecutive
| small writes and nagle won't kick in.
| astrange wrote:
| FreeBSD has accept filters, which let you do something like
| wait for a complete HTTP header (inaccurate from memory
| summary.) Not sure about the sending side.
| deathanatos wrote:
| How is what you're describing not just Nagle's algorithm?
|
| If you mean TCP_NODELAY, you should use it with TCP_CORK, which
| prevents partial frames. TCP_CORK the socket, do your writes to
| the kernel via send, and then once you have an application
| level "message" ready to send out -- i.e., once you're at the
| point where you're going to go to sleep and wait for the other
| end to respond, unset TCP_CORK & then go back to your event
| loop & sleep. The "uncork" at the end + nodelay sends the final
| partial frame, if there is one.
| elhosots wrote:
| This sounds like the root of my vncviewer / server interaction
| bugs i experience with some vnc viewer/server combo's between
| ubuntu linux and freebsd... (tight/tiger)
| evanelias wrote:
| John Nagle has posted insightful comments about the historical
| background for this many times, for example
| https://news.ycombinator.com/item?id=9048947 referenced in the
| article. He's a prolific HN commenter (#11 on the leaderboard) so
| it can be hard to find everything, but some more comments
| searchable via
| https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...
| or
| https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...
| Animats wrote:
| The sending pattern matters. Send/Receive/Send/Receive won't
| trigger the problem, because the request will go out
| immediately and the reply will provide an ACK and allow another
| request. Bulk transfers won't cause the problem, because if you
| fill the outgoing block size, there's no delay.
|
| But Send/Send/Receive will. This comes up a lot in game
| systems, where most of the traffic is small events going one
| way.
| EvanAnderson wrote:
| I love it when Nagle's algorithm comes up on HN. Inevitably
| someone, not knowing "Animats" is John Nagle, responds a
| comment from Animats with a "knowing better" tone. >smile<
|
| (I also really like Animats' comments, too.)
| geoelectric wrote:
| I have to confess that when I saw this post, I quickly
| skimmed the threads to check if someone was trying to educate
| Animats on TCP. Think I've only seen that happen in the wild
| once or twice, but it absolutely made my day when it did.
| pclmulqdq wrote:
| In a world where bandwidth was limited, and the packet size
| minimum was 64 bytes plus an inter-frame gap (it still is for
| most Ethernet networks), sending a TCP packet for literally every
| byte wasted a huge amount of bandwidth. The same goes for sending
| empty acks.
|
| On the other hand, my general position is: it's not TCP_NODELAY,
| it's TCP.
| metadaemon wrote:
| I'd just love a protocol that has a built in mechanism for
| realizing the other side of the pipe disconnected for any
| reason.
| koverstreet wrote:
| Like TCP keepalives?
| mort96 wrote:
| If the feature already technically exists in TCP, it's
| either broken or disabled by default, which is pretty much
| the same as not having it.
| voxic11 wrote:
| keepalives are an optional TCP feature so they are not
| necessarily supported by all TCP implementations and
| therefor default to off even when supported.
| the8472 wrote:
| If a socket is closed properly there'll be a FIN and the
| other side can learn about it by polling the socket.
|
| If the network connection is lost due to external
| circumstances (say your modem crashes) then how would that
| information propagate from the point of failure to the remote
| end _on an idle connection_? Either you actively probe
| (keepalives) and risk false positives or you wait until you
| hear again from the other side, risking false negatives.
| sophacles wrote:
| It gets even worse - routing changes causing traffic to
| blackhole would still be undetectable without a timeout
| mechanism, since probes and responses would be lost.
| toast0 wrote:
| That's possible in circuit switched networking with various
| types of supervision, but packet switched networking has
| taken over because it's much less expensive to implement.
|
| Attempts to add connection monitoring usually make things
| worse --- if you need to reroute a cable, and one or both
| ends of the cable will detect a cable disconnection and close
| user sockets, that's not great, now you do a quick change
| with a small period of data loss but otherwise minor
| interruption; all of the established connections will be
| dropped.
| noselasd wrote:
| SCTP has hearbeats to detect that.
| sophacles wrote:
| That's really really hard. For a full, guaranteed way to do
| this we'd need circuit switching (or circuit switching
| emulation). It's pretty expensive to do in packet networks -
| each flow would need to be tracked by each middle box, so a
| lot more RAM at every hop, and probably a lot more processing
| power. If we go with circuit establishment, its also kind of
| expensive and breaks the whole "distributed, decentralized,
| self-healing network" property of the Internet.
|
| It's possible to do better than TCP these days, bandwidth is
| much much less constrained than it was when TCP was designed,
| but it's still a hard problem to do detection of pipe
| disconnected for _any_ reason other than timeouts (which we
| already have).
| pclmulqdq wrote:
| Several of the "reliable UDP" protocols I have worked on in
| the past have had a heartbeat mechanism that is specifically
| for detecting this. If you haven't sent a packet down the
| wire in 10-100 milliseconds, you will send an extra packet
| just to say you're still there.
|
| It's very useful to do this in intra-datacenter protocols.
| 01HNNWZ0MV43FF wrote:
| To re-word everyone else's comments - "Disconnected" is not
| well-defined in any network.
| jallmann wrote:
| These types of keepalives are usually best handled at the
| application protocol layer where you can design in more knobs
| and respond in different ways. Otherwise you may see
| unexpected interactions between different keepalive
| mechanisms in different parts of the protocol stack.
| niutech wrote:
| Shouldn't QUIC (https://en.wikipedia.org/wiki/QUIC) solve the
| TCP issues like latency?
| zengid wrote:
| Relevant Oxide and Friends podcast episode
| https://www.youtube.com/watch?v=mqvVmYhclAg
| matthavener wrote:
| This was a great episode and the really drove home the
| importance of visualization.
| rsc wrote:
| Not if you use a modern language that enables TCP_NODELAY by
| default, like Go. :-)
| andrewfromx wrote:
| https://news.ycombinator.com/item?id=34179426
|
| https://github.com/golang/go/issues/57530
|
| huh, TIL.
| ironman1478 wrote:
| I've fixed multiple latency issues due to nagle's multiple times
| in my career. It's the first thing I jump to. I feel like the
| logic behind it is sound, but it just doesn't work for some
| workloads. It should be something that an engineer needs to be
| forced to set while creating a socket, instead of letting the OS
| choose a default. I think that's the main issue. Not that it's a
| good / bad option but that there is a setting that people might
| not know about that manipulates how data is sent over the wire so
| aggressively.
| hinkley wrote:
| What you really want is for the delay to be n microseconds, but
| there's no good way to do that except putting your own user
| space buffering in front of the system calls (user space works
| better, unless you have something like io_uring amortizing
| system call times)
| Bluecobra wrote:
| I agree, it has been fairly well known to disable Nagle's
| Algorithm in HFT/low latency trading circles for quite some
| time now (like > 15 years). It's one of the first things I look
| for.
| Scubabear68 wrote:
| I was setting TCP_NODELAY at Bear Stearns for custom
| networking code circa 1994 or so.
| mcoliver wrote:
| Same in M&E / vfx
| nsguy wrote:
| The logic is really for things like Telnet sessions. IIRC that
| was the whole motivation.
| nailer wrote:
| You're right re: making delay explicit, but also crappy use the
| space networking tools don't show whether no_delay is enabled
| on sockets.
|
| Last time I had to do some Linux stuff, maybe 10 years ago you
| had to write a systemtap program. I guess it's EBNF now. But I
| bet the userspace tools still suck.
| Sebb767 wrote:
| > It should be something that an engineer needs to be forced to
| set while creating a socket, instead of letting the OS choose a
| default.
|
| If the intention is mostly to fix applications with bad
| `write`-behavior, this would make setting TCP_DELAY a pretty
| exotic option - you would need a software engineer to be both
| smart enough to know to set this option, but not smart enough
| to distribute their write-calls well and/or not go for writing
| their own (probably better fitted) application-specific version
| of Nagles.
| JoshTriplett wrote:
| I do wish that TCP_NODELAY was the default, and there was a
| TCP_DELAY option instead. That'd be a world in which people who
| _want_ the batch-style behavior (optimizing for throughput and
| fewer packets at the expense of latency) could still opt into it.
| mzs wrote:
| So do I, but I with there was a new one TCP_RTTDELAY. It would
| take a byte that would be what 128th of RTT you want to use for
| Nagle instead of one RTT or full* buffer. 0 would be the
| default, behaving as you and I prefer.
|
| * "Given the vast amount of work a modern server can do in even
| a few hundred microseconds, delaying sending data for even one
| RTT isn't clearly a win."
|
| I don't think that's such an issue anymore either, given that
| the server produces so much data it fills the output buffer
| quickly anyway, the data is then immediately sent before the
| delay runs its course.
| stonemetal12 wrote:
| >To make a clearer case, let's turn back to the justification
| behind Nagle's algorithm: amortizing the cost of headers and
| avoiding that 40x overhead on single-byte packets. But does
| anybody send single byte packets anymore?
|
| That is a bit of a strawman there. While he uses single byte
| packets as the worst case example, the issue as stated is any not
| full packet.
| somat wrote:
| What about the opposite, disable delayed acks.
|
| The problem is the pathological behavior when tinygram prevention
| interacts with delayed acks. There is an exposed option to turn
| off tinygram prevention(TCP_NODELAY), how would you tun off
| delayed acks instead? Say if you wanted to benchmark all four
| combinations and see what works best.
|
| doing a little research I found:
|
| linux has the TCP_QUICKACK socket option but you have to set it
| every time you receive. there is also
| /proc/sys/net/ipv4/tcp_delack_min and
| /proc/sys/net/ipv4/tcp_ato_min
|
| freebsd has net.inet.tcp.delayed_ack and net.inet.tcp.delacktime
| Animats wrote:
| > linux has the TCP_QUICKACK socket option but you have to set
| it every time you receive
|
| Right. What were they thinking? Why would you want it off only
| some of the time?
| batmanthehorse wrote:
| In CentOS/RedHat you can add `quickack 1` to the end of a route
| to tell it to disable delayed acks for that route.
| mjb wrote:
| TCP_QUICKACK does fix the worst version of the problem, but
| doesn't fix the entire problem. Nagles algorithm will still
| wait for up to one round-trip time before sending data (at
| least as specified in the RFC), which is extra latency with
| nearly no added value.
| benreesman wrote:
| Nagle and no delay are like like 90+% of the latency bugs I've
| dealt with.
|
| Two reasonable ideas that mix terribly in practice.
| tempaskhn wrote:
| Wow, never would have thought of that.
| jedberg wrote:
| This is an interesting thing that points out why abstraction
| layers can be bad without proper message passing mechanisms.
|
| This could be fixed if there was a way for the application at L7
| to tell the TCP stack at L4 "hey, I'm an interactive shell so I
| expect to have a lot of tiny packets, you should leave
| TCP_NODELAY on for these packets" so that it can be off by
| default but on for that application to reduce overhead.
|
| Of course nowadays it's probably an unnecessary optimization
| anyway, but back in '84 it would have been super handy.
| landswipe wrote:
| Use UDP ;)
| gafferongames wrote:
| Bingo
| epolanski wrote:
| I was curious whether I had to change anything in my applications
| after reading that so did a bit of research.
|
| Both Node.js and Curl use TCP_NODELAY by default from a long
| time.
| gafferongames wrote:
| It's always TCP_NODELAY. Except when it's head of line blocking,
| then it's not.
___________________________________________________________________
(page generated 2024-05-09 23:00 UTC)