[HN Gopher] When can two TCP sockets share a local address?
       ___________________________________________________________________
        
       When can two TCP sockets share a local address?
        
       Author : jgrahamc
       Score  : 148 points
       Date   : 2023-03-20 13:35 UTC (9 hours ago)
        
 (HTM) web link (blog.cloudflare.com)
 (TXT) w3m dump (blog.cloudflare.com)
        
       | nh23423fefe wrote:
       | meaningless seo quantum keyword
        
         | ilyt wrote:
         | The content is good, the way it is written is terrible
        
           | jkbs wrote:
           | Any pointers? What would you change?
        
       | pm2222 wrote:
       | Was hoping to see a summary, is there one?
       | 
       | Another case is with nat at internet boundary. As far as I can
       | tell, Cisco/palo firewall only track local_ip:port so one public
       | ip only gives you 64k connections top.
        
         | [deleted]
        
         | hinkley wrote:
         | The NAT box is proxying all the traffic. To the server you're
         | contacting, the NAT box is the address. Which leads to fun
         | things when you try to embed the callback address into the line
         | protocol between the machines. It doesn't work.
         | 
         | So the NAT box gets 64k local ports for the entire
         | organization, not for your one computer. That's still a lot,
         | but if you have one or two popular remotes you talk to, you
         | only get less than 2x64k combinations (say, 64k + 32k). If you
         | have 10 employees that's not a problem. If you have 50k and
         | each service makes 2 connections, you can run out quickly.
        
           | toast0 wrote:
           | It could be that the NAT box isn't so great, and just limits
           | itself to 64k outgoing connections per IP. Really depends on
           | how hard the developers wanted to work. TCP itself doesn't
           | have that limitation, but implementations of it may.
        
             | hinkley wrote:
             | NAT can be a feature or the whole product.
             | 
             | I would certainly hope there are boxes out there that
             | present multiple IP addresses to the 'outside' network and
             | avoid these problems. It's an old technology and there's
             | been plenty of room for improvement. I haven't caught any
             | consumer products (eg, access points, routers, firewalls)
             | doing this, and a lot of small businesses... well sometimes
             | my work network is better maintained than my home network,
             | and sometimes not so much.
        
         | majke wrote:
         | Let me try. You could expect that on linux host you can have
         | 64K concurrent connections to single target ip. If you had two
         | target ips you could expect 128k concurrent conns.
         | 
         | This is not how it works.
         | 
         | Especially when doing bind-before-connect trick to set source
         | ip.
         | 
         | Linux internally tracks local ports in a hashtable and often
         | forbids their reuse in surprising ways.
         | 
         | The most surprising is ordering. If you do bind-before-connect
         | then on that port later connect() will not work.
         | 
         | The effect is that its very hard to achieve the 128k
         | conncurrent in our two targets scenario
        
           | toast0 wrote:
           | I've done similar exercises with FreeBSD. The tldr is, if you
           | want to approach the connection limits[1], you need to use
           | bind-before-connect for the bulk connections, and it's useful
           | to have connect on a separate (small) port range, for the
           | ancillary connections your machine might have. There's lots
           | of other things you might want to do, such as divide your
           | outgoing ports by thread/cpu so there's no conflicts; and if
           | you're dividing by cpu, you should probably calculate the
           | recieve side scaling (RSS) hash, most likely Toeplitz of the
           | returning packets, so you work the socket from userspace on
           | the same CPU it's going to be worked in kernelspace when it
           | comes in.
           | 
           | Modern FreeBSD and presumably Linux are very good at scaling
           | inbound connections, but if you need to scale outbound
           | connections, you've got to do more of the work.
           | 
           | [1] Also, consider if you're in the client seat, you're more
           | likely to be the one closing the connections, so you've got
           | TIME_WAIT states on your side; depending on your rate of
           | connection closing, you may have a significant number of
           | those, clogging up your ports; or it may not be significant
           | at all.
        
       | wyldfire wrote:
       | The identity of a TCP/IP connection is a quintuple containing {
       | src addr, src port, dest addr, dest port, proto }. It feels like
       | the article would be better if before or after all the network
       | stack spelunking and quizzing that it was mentioned somewhere. We
       | don't _need_ to rely on the implementation, though there 's
       | always interesting corner cases worth exploring.
        
         | twic wrote:
         | What is "proto" in your tuple? If we're talking about TCP, then
         | it's always TCP, isn't it?
         | 
         | I think that formulation was invented to explain the fact that
         | TCP and UDP can have distinct otherwise identical sockets, but
         | i don't think it's a great way to do that. TCP and UDP just
         | have completely separate spaces of socket addresses, the same
         | way TCP and NetBIOS or TCP and UNIX domain sockets do.
        
           | toast0 wrote:
           | Well, when one says TCP/IP, they usually mean to include UDP
           | and ICMP. Although ICMP doesn't have ports, so managing state
           | is _different_.
           | 
           | UDP and TCP both use 4-tuples with the same information, so
           | even though I think it's more common to have a separate table
           | for UDP and TCP, you can conceptually consider it a 5-tuple.
           | It's all a conceptual model, but I'd put protocol up front,
           | {tcp, RemoteIP, LocalIP, RemotePort, LocalPort}, {udp,
           | RemoteIP, LocalIP, RemotePort, LocalPort}, {unix, Path},
           | {netbios, IDontRememberHowItsAddressed}, {icmp,
           | SomethingConfusing}, etc. If you can't handle multiple arity
           | tuples, you could make a nested 2-tuple for tcp and udp, like
           | {tcp, {RemoteIP ... }}. It's all just conceptual notation
           | though, so there's tons of ways to do it (you'll see I differ
           | in both names and ordering compared to the other commenters,
           | but that's not actually significant either)
        
             | MuffinFlavored wrote:
             | > {tcp, RemoteIP, LocalIP, RemotePort, LocalPort}
             | 
             | Does the concept of Remote/Local IP have to do/get
             | introduced when you discuss NAT?
        
               | toast0 wrote:
               | I would use Remote and Local for host networking first of
               | all; rather than src/dest, because when you send you're
               | the src, and when you receive, you're the dst... you
               | don't want to include both permutations in the table
               | (unless you're both the source and the destination, ie:
               | connecting to yourself).
               | 
               | For NAT, you need to have a way to calculate the 5-tuple
               | for SideA when you have a 5-tuple from SideB, and vice
               | versa; most often, that'll be a table lookup, either for
               | the whole 5-tuple, or for 1:1 NAT, it could just be a
               | lookup for the "Local" IP. In that case, maybe src and
               | dest make more sense, and the NAT isn't really Local in
               | my book.
        
           | wyldfire wrote:
           | Indeed - I fudged things a bit by talking about TCP there. It
           | would have been clearer if I just discussed IP instead.
           | 
           | > TCP and UDP just have completely separate spaces of socket
           | addresses
           | 
           | But so does SCTP, and ICMP and IGMP and ... -- so rather than
           | enumerate the protocols we can just describe this property of
           | IP.
        
             | twic wrote:
             | But the structure of those spaces can be different! The
             | only structure IP imposes is that every packet has a source
             | and destination address. It's up to each protocol whether
             | it has port numbers (like TCP, UDP, and SCTP), or not (like
             | ICMP and IGMP), or some other mechanism for identifying
             | flows.
        
             | derefr wrote:
             | Yeah, but for ICMP and so forth, ports aren't a thing. So
             | you don't really have a universal 5-tuple.
             | 
             | For a router or other middlebox, or an OS kernel, to do
             | things like outbound-initiated-flow firewall-rule
             | exceptions correctly, it must keep N different flow-state
             | tables, one per transport-layer (L4) protocol; where each
             | flow-state table's "primary key" is over a set of columns
             | unique to that table / L4 protocol.
             | 
             | TCP and UDP just happen to be both the best-known L4
             | protocols, and to both use {srcIP, srcPort, dstIP, dstPort}
             | as their "primary key" for flows; but this doesn't hold for
             | other L4 protocols.
             | 
             | (Which is in turn why L4 protocols "must" be handled in
             | kernel-land, for kernel firewalls, traffic-shapers, etc. to
             | work: L4 flow-state doesn't have a universal schema for
             | these services to work with; and because these services are
             | implemented in static-compiled languages, they have to be
             | built with compile-time knowledge of each known L4
             | protocol, so that they can have concrete implementations
             | for each L4 protocol written or generated for each service.
             | There's no way to just bring in (through some hypothetical
             | FUSE-like "userland L4 protocol server" abstraction) more
             | L4 protocols, and expect those kernel facilities to work
             | with them. [And all the same goes for ASICs in L4 network
             | routers -- only moreso.] Which is why we got the L4
             | protocol ossification we did. Modern protocols like SCTP
             | and QUIC being implemented on top of UDP, is a direct
             | result of there being no universal 5-tuple!)
        
               | richardwhiuk wrote:
               | You can handle L4 protocols in userspace. You can bind to
               | a particular IP protocol number. You can even handle L3
               | in userspace.
               | 
               | Obviously if you do this you lose the ability for
               | multiple applications to handle different "ports", unless
               | you do the multiplexing in userspace as well.
        
               | derefr wrote:
               | You have a unique definition of "handle" that doesn't
               | seem to include "your OS's kernel packet filter keeps
               | working to pre-filter these packets based on an L4
               | understanding of them before handing them to userspace,
               | or after being handed them _by_ userspace. "
               | 
               | Which, if your machine is acting as something like a
               | router/NAT/firewall, is kind of... the entire point of
               | the box being there in the communication path.
        
         | loeg wrote:
         | I agree that is super important and maybe worth mentioning but
         | the point of the quiz is to demonstrate that Linux's
         | implementation is actually more constrained than the
         | traditional "unique src addr/port dst addr/port 4-tuple" (for
         | TCP).
        
           | DSMan195276 wrote:
           | Yeah I feel like people are missing the point that it's not a
           | 4-tuple thing, it's an ordering issue. Since the source port
           | (sometimes) gets picked before the interface or destination
           | is, you can get an EADDRNOTAVAIL result even when there's
           | technically a potential for it to end-up as a unique 4-tuple.
           | Doing the assignment in a different order or more explicitly
           | can allow it to work by making sure that the kernel always
           | knows it will be unique.
        
         | [deleted]
        
       | chatmasta wrote:
       | I'm surprised the blog post does not mention Cloudflare's own
       | library, tubular [0], the "BSD socket API on steroids":
       | 
       | > The control plane for BPF socket lookup. Steers traffic that
       | arrives via the tubes of the Internet to processes running on the
       | machine. Its much more flexible than traditional BSD bind
       | semantics:
       | 
       | > * You can bind to all ports on an IP
       | 
       | > * You can bind to a subnet instead of an IP
       | 
       | > * You can bind to all ports on a subnet
       | 
       | I played with it once and found it to be pretty awesome.
       | 
       | [0] https://github.com/cloudflare/tubular
        
         | majke wrote:
         | Thanks for reminder. This was a blog post after we hit the
         | connectx() troubles :)
         | 
         | Tubular/sk_lookup is about ingress.
         | 
         | This blog post is about connected sockets on "egress".
         | 
         | Actually, bpf sk lookup could be used on egress, but its not
         | quite yet implemented
        
       | loeg wrote:
       | Nice drgn shout-out 2/3 of the way down the page!
       | 
       | https://drgn.readthedocs.io/en/latest/index.html
        
       | waynesonfire wrote:
       | Can someone TLDR this? interesting question but not worth reading
       | all this to find out.
       | 
       | Can two IPs bind to the same IP / port on the server? Even if
       | they come from the same client, my intuition says no.
        
         | toast0 wrote:
         | I'm having a hard time understanding your question. But let me
         | try. If none of these answer your question, you're going to
         | need to provide some more concrete details.
         | 
         | If your sever has IPs 192.0.2.1 and 192.0.2.2. You can bind to
         | 192.0.2.1:80 on one socket and 192.0.2.2:80 on another and
         | listen for connections on both (or use it for outgoing
         | connections, whatever). Or you could bind to 0.0.0.0:80 and
         | listen for connections on either, as long as they don't have
         | more specific bindings.
         | 
         | If your client has IP 198.51.100.1, it can connect from tcp
         | 198.51.100.1:18490 to 192.0.2.1:80 in one socket, and from tcp
         | 198.51.100.1:18490 to 192.0.2.2:80 in another. You could also
         | do the same on UDP, UDP's ports are orthogonal to TCP's.
         | 
         | If you have another client with IP 198.51.100.2, it can also
         | use local port 18490 to connect to both servers on port 80.
         | 
         | A server can accept a virtually unlimited number of connections
         | on a given tcp port, but a maximum of 64k per client IP, and
         | your operating system will have some limit on the number of
         | sockets; you can often raise that to some function of physical
         | memory (FreeBSD limits you to a maximum of one fd for each four
         | pages of physical memory); although, it takes a very specific
         | load to be able to max out the FDs. Something like HAProxy in
         | tcp forwarding mode can do it, but most other applications are
         | going to use enough CPU and memory that you'll run out of those
         | before you run out of FDs.
        
       | jmholla wrote:
       | I didn't really like the format of this article. The upfront quiz
       | without much context or insight makes it feel like you're in
       | class for an exam you didn't study for. The answers they provide
       | are largely just Yes or No. When you get to the meat of the
       | article, the quiz itself is forgotten until the conclusion. In
       | fact, they even mention how useless the quiz is to provide
       | insight at the beginning of the instructive content.
       | 
       | > Is it all clear now? Well, probably no. It feels like reverse
       | engineering a black box. So what is happening behind the scenes?
       | Let's take a look.
       | 
       | Keeping the quiz format, I think it would've been better to have
       | it at the end then have the answers tie into the concepts you've
       | learned. Or have the questions littered throughout the article
       | and used to build upon the concepts it is teaching you. In its
       | current state, I don't see this article's information having much
       | staying power in my brain without concentrated study of the
       | article.
        
         | dan-robertson wrote:
         | I enjoyed the quiz personally. I was kinda expecting more
         | complicated questions, eg nat, iptables rules selecting local
         | ip / interface, weird networking edge cases and so on.
        
         | [deleted]
        
         | AstixAndBelix wrote:
         | This article is meant for tech people and the purpose of the
         | quiz is make you realize how little you know about the subject
         | even as an expert. This is not a pop-sci youtube video
        
           | ilyt wrote:
           | And it fails at that, all the format is is being annoying,
           | and having to go thru links to random github gists is extra
           | annoying if you didn't know the answer and want to just learn
        
         | dcow wrote:
         | Same. I don't want to go run python code to play your little
         | quiz game, sorry.
         | 
         | I think the article would be better structured as an
         | explanation of what a connection _is_ , its 5-tuple, and then a
         | dive into any surprising exceptions to the rule that there can
         | be no ambiguity in where a packet is delivered, it must go to
         | exactly one spot on a host. So as long as you can find an
         | unambiguous subset of available 5-tuples, you're allowed to
         | bind/connect.
        
           | loeg wrote:
           | The Python quiz code is easily read as identical to the
           | equivalent syscalls. You can click through to the answers and
           | read them out of the docstrings without ever running a Python
           | interpreter (that's what I did).
        
             | dcow wrote:
             | It was 7am and I wasn't in the mood to hop back and forth
             | between GH and the blog post trying a bunch of examples and
             | piecing things together. I just wanted to read an
             | interesting writeup. That's all.
        
           | fierro wrote:
           | agreed, the post is missing a clear explanation of what
           | defines a TCP connection
        
             | jkbs wrote:
             | Fair point, though gauging what is the knowledge-level of
             | your audience is hard. Gotta pick what to include and what
             | to omit if you want to keep it concise.
        
               | fierro wrote:
               | for sure, gauging your audience is difficult. Still, any
               | piece of writing benefits from a good framing. Great
               | post, but I think jumping right into the details loses X%
               | of potential readers. Keep up the good work :)
        
       | jeffbee wrote:
       | One of the best things about user-space TCP implementations is
       | never needing to be an archaeologist, never trying to figure out
       | the path-dependent sequence of accidents that leads to surprising
       | and undocumented kernel behaviors.
        
         | forgot_old_user wrote:
         | But wouldn't the same issue apply to user-space TCP
         | implementations too? User-space TCP implementations too could
         | have "path-dependent sequence of accidents" which a power user
         | might eventually need to figure out?
        
           | jeffbee wrote:
           | Yes but instead of being a 35-year-old accretion of mistakes,
           | a user-space network stack is likely to be part of a more
           | typical software lifecycle, that gets updated more easily and
           | ultimately replaced. Also such things are dramatically easier
           | to debug.
        
             | loeg wrote:
             | Happily Linux is at most a 32-year old accretion of
             | mistakes (and I'm not completely confident 0.01 even had
             | TCP).
        
         | FPGAhacker wrote:
         | What are arguments against user-space TCP?
        
           | ilyt wrote:
           | Normally if app opens a port it is not allowed to by firewall
           | or application permissions it will just get error, with RAW
           | sockets kernel would need to parse packet before deciding
           | that.
           | 
           | For example normally you will get permission denied when you
           | try to listen on sub-1024 port on normal user.
           | 
           | I'd also imagine if kernel is doing any kind of connection
           | tracking (so really anything with firewall), it would be more
           | optimal to _have_ that connection tracked in kernel vs
           | decoding it and adding to conntrack table.
           | 
           | I guess some kind of half-RAW could be done in place, like
           | say a socket where you define protocol and port but handle
           | actual packets in userspace ?
        
           | jeffbee wrote:
           | It doesn't work out-of-the-box with any existing software, so
           | it only applies to things you are building from scratch, or
           | that are structured in such a way that adding an alternate
           | network mode, or that would work with a dynamic library that
           | shims the entire sockets API. Your user-mode stack won't be
           | observable by any existing monitoring tools. Also, your
           | process needs to execute with CAP_NET_RAW.
        
           | chatmasta wrote:
           | One argument is that it requires building on top of a raw
           | socket, which can open you to all sorts of ancient
           | vulnerabilities that have been patched in the battle-tested
           | code running in the kernel, e.g. this recent ICMP remote code
           | execution vulnerability [0] ("An attacker could send a low-
           | level protocol error containing a fragmented IP packet inside
           | another ICMP packet in its header to the target machine. To
           | trigger the vulnerable code path, an application on the
           | target must be bound to a raw socket") [1].
           | 
           | [0] Discussion: https://old.reddit.com/r/netsec/comments/11s8
           | 0zo/cve20232341...
           | 
           | [1] Advisory: https://msrc.microsoft.com/update-
           | guide/vulnerability/CVE-20...
        
       ___________________________________________________________________
       (page generated 2023-03-20 23:01 UTC)