[HN Gopher] When can two TCP sockets share a local address?
___________________________________________________________________
When can two TCP sockets share a local address?
Author : jgrahamc
Score : 148 points
Date : 2023-03-20 13:35 UTC (9 hours ago)
(HTM) web link (blog.cloudflare.com)
(TXT) w3m dump (blog.cloudflare.com)
| nh23423fefe wrote:
| meaningless seo quantum keyword
| ilyt wrote:
| The content is good, the way it is written is terrible
| jkbs wrote:
| Any pointers? What would you change?
| pm2222 wrote:
| Was hoping to see a summary, is there one?
|
| Another case is with nat at internet boundary. As far as I can
| tell, Cisco/palo firewall only track local_ip:port so one public
| ip only gives you 64k connections top.
| [deleted]
| hinkley wrote:
| The NAT box is proxying all the traffic. To the server you're
| contacting, the NAT box is the address. Which leads to fun
| things when you try to embed the callback address into the line
| protocol between the machines. It doesn't work.
|
| So the NAT box gets 64k local ports for the entire
| organization, not for your one computer. That's still a lot,
| but if you have one or two popular remotes you talk to, you
| only get less than 2x64k combinations (say, 64k + 32k). If you
| have 10 employees that's not a problem. If you have 50k and
| each service makes 2 connections, you can run out quickly.
| toast0 wrote:
| It could be that the NAT box isn't so great, and just limits
| itself to 64k outgoing connections per IP. Really depends on
| how hard the developers wanted to work. TCP itself doesn't
| have that limitation, but implementations of it may.
| hinkley wrote:
| NAT can be a feature or the whole product.
|
| I would certainly hope there are boxes out there that
| present multiple IP addresses to the 'outside' network and
| avoid these problems. It's an old technology and there's
| been plenty of room for improvement. I haven't caught any
| consumer products (eg, access points, routers, firewalls)
| doing this, and a lot of small businesses... well sometimes
| my work network is better maintained than my home network,
| and sometimes not so much.
| majke wrote:
| Let me try. You could expect that on linux host you can have
| 64K concurrent connections to single target ip. If you had two
| target ips you could expect 128k concurrent conns.
|
| This is not how it works.
|
| Especially when doing bind-before-connect trick to set source
| ip.
|
| Linux internally tracks local ports in a hashtable and often
| forbids their reuse in surprising ways.
|
| The most surprising is ordering. If you do bind-before-connect
| then on that port later connect() will not work.
|
| The effect is that its very hard to achieve the 128k
| conncurrent in our two targets scenario
| toast0 wrote:
| I've done similar exercises with FreeBSD. The tldr is, if you
| want to approach the connection limits[1], you need to use
| bind-before-connect for the bulk connections, and it's useful
| to have connect on a separate (small) port range, for the
| ancillary connections your machine might have. There's lots
| of other things you might want to do, such as divide your
| outgoing ports by thread/cpu so there's no conflicts; and if
| you're dividing by cpu, you should probably calculate the
| recieve side scaling (RSS) hash, most likely Toeplitz of the
| returning packets, so you work the socket from userspace on
| the same CPU it's going to be worked in kernelspace when it
| comes in.
|
| Modern FreeBSD and presumably Linux are very good at scaling
| inbound connections, but if you need to scale outbound
| connections, you've got to do more of the work.
|
| [1] Also, consider if you're in the client seat, you're more
| likely to be the one closing the connections, so you've got
| TIME_WAIT states on your side; depending on your rate of
| connection closing, you may have a significant number of
| those, clogging up your ports; or it may not be significant
| at all.
| wyldfire wrote:
| The identity of a TCP/IP connection is a quintuple containing {
| src addr, src port, dest addr, dest port, proto }. It feels like
| the article would be better if before or after all the network
| stack spelunking and quizzing that it was mentioned somewhere. We
| don't _need_ to rely on the implementation, though there 's
| always interesting corner cases worth exploring.
| twic wrote:
| What is "proto" in your tuple? If we're talking about TCP, then
| it's always TCP, isn't it?
|
| I think that formulation was invented to explain the fact that
| TCP and UDP can have distinct otherwise identical sockets, but
| i don't think it's a great way to do that. TCP and UDP just
| have completely separate spaces of socket addresses, the same
| way TCP and NetBIOS or TCP and UNIX domain sockets do.
| toast0 wrote:
| Well, when one says TCP/IP, they usually mean to include UDP
| and ICMP. Although ICMP doesn't have ports, so managing state
| is _different_.
|
| UDP and TCP both use 4-tuples with the same information, so
| even though I think it's more common to have a separate table
| for UDP and TCP, you can conceptually consider it a 5-tuple.
| It's all a conceptual model, but I'd put protocol up front,
| {tcp, RemoteIP, LocalIP, RemotePort, LocalPort}, {udp,
| RemoteIP, LocalIP, RemotePort, LocalPort}, {unix, Path},
| {netbios, IDontRememberHowItsAddressed}, {icmp,
| SomethingConfusing}, etc. If you can't handle multiple arity
| tuples, you could make a nested 2-tuple for tcp and udp, like
| {tcp, {RemoteIP ... }}. It's all just conceptual notation
| though, so there's tons of ways to do it (you'll see I differ
| in both names and ordering compared to the other commenters,
| but that's not actually significant either)
| MuffinFlavored wrote:
| > {tcp, RemoteIP, LocalIP, RemotePort, LocalPort}
|
| Does the concept of Remote/Local IP have to do/get
| introduced when you discuss NAT?
| toast0 wrote:
| I would use Remote and Local for host networking first of
| all; rather than src/dest, because when you send you're
| the src, and when you receive, you're the dst... you
| don't want to include both permutations in the table
| (unless you're both the source and the destination, ie:
| connecting to yourself).
|
| For NAT, you need to have a way to calculate the 5-tuple
| for SideA when you have a 5-tuple from SideB, and vice
| versa; most often, that'll be a table lookup, either for
| the whole 5-tuple, or for 1:1 NAT, it could just be a
| lookup for the "Local" IP. In that case, maybe src and
| dest make more sense, and the NAT isn't really Local in
| my book.
| wyldfire wrote:
| Indeed - I fudged things a bit by talking about TCP there. It
| would have been clearer if I just discussed IP instead.
|
| > TCP and UDP just have completely separate spaces of socket
| addresses
|
| But so does SCTP, and ICMP and IGMP and ... -- so rather than
| enumerate the protocols we can just describe this property of
| IP.
| twic wrote:
| But the structure of those spaces can be different! The
| only structure IP imposes is that every packet has a source
| and destination address. It's up to each protocol whether
| it has port numbers (like TCP, UDP, and SCTP), or not (like
| ICMP and IGMP), or some other mechanism for identifying
| flows.
| derefr wrote:
| Yeah, but for ICMP and so forth, ports aren't a thing. So
| you don't really have a universal 5-tuple.
|
| For a router or other middlebox, or an OS kernel, to do
| things like outbound-initiated-flow firewall-rule
| exceptions correctly, it must keep N different flow-state
| tables, one per transport-layer (L4) protocol; where each
| flow-state table's "primary key" is over a set of columns
| unique to that table / L4 protocol.
|
| TCP and UDP just happen to be both the best-known L4
| protocols, and to both use {srcIP, srcPort, dstIP, dstPort}
| as their "primary key" for flows; but this doesn't hold for
| other L4 protocols.
|
| (Which is in turn why L4 protocols "must" be handled in
| kernel-land, for kernel firewalls, traffic-shapers, etc. to
| work: L4 flow-state doesn't have a universal schema for
| these services to work with; and because these services are
| implemented in static-compiled languages, they have to be
| built with compile-time knowledge of each known L4
| protocol, so that they can have concrete implementations
| for each L4 protocol written or generated for each service.
| There's no way to just bring in (through some hypothetical
| FUSE-like "userland L4 protocol server" abstraction) more
| L4 protocols, and expect those kernel facilities to work
| with them. [And all the same goes for ASICs in L4 network
| routers -- only moreso.] Which is why we got the L4
| protocol ossification we did. Modern protocols like SCTP
| and QUIC being implemented on top of UDP, is a direct
| result of there being no universal 5-tuple!)
| richardwhiuk wrote:
| You can handle L4 protocols in userspace. You can bind to
| a particular IP protocol number. You can even handle L3
| in userspace.
|
| Obviously if you do this you lose the ability for
| multiple applications to handle different "ports", unless
| you do the multiplexing in userspace as well.
| derefr wrote:
| You have a unique definition of "handle" that doesn't
| seem to include "your OS's kernel packet filter keeps
| working to pre-filter these packets based on an L4
| understanding of them before handing them to userspace,
| or after being handed them _by_ userspace. "
|
| Which, if your machine is acting as something like a
| router/NAT/firewall, is kind of... the entire point of
| the box being there in the communication path.
| loeg wrote:
| I agree that is super important and maybe worth mentioning but
| the point of the quiz is to demonstrate that Linux's
| implementation is actually more constrained than the
| traditional "unique src addr/port dst addr/port 4-tuple" (for
| TCP).
| DSMan195276 wrote:
| Yeah I feel like people are missing the point that it's not a
| 4-tuple thing, it's an ordering issue. Since the source port
| (sometimes) gets picked before the interface or destination
| is, you can get an EADDRNOTAVAIL result even when there's
| technically a potential for it to end-up as a unique 4-tuple.
| Doing the assignment in a different order or more explicitly
| can allow it to work by making sure that the kernel always
| knows it will be unique.
| [deleted]
| chatmasta wrote:
| I'm surprised the blog post does not mention Cloudflare's own
| library, tubular [0], the "BSD socket API on steroids":
|
| > The control plane for BPF socket lookup. Steers traffic that
| arrives via the tubes of the Internet to processes running on the
| machine. Its much more flexible than traditional BSD bind
| semantics:
|
| > * You can bind to all ports on an IP
|
| > * You can bind to a subnet instead of an IP
|
| > * You can bind to all ports on a subnet
|
| I played with it once and found it to be pretty awesome.
|
| [0] https://github.com/cloudflare/tubular
| majke wrote:
| Thanks for reminder. This was a blog post after we hit the
| connectx() troubles :)
|
| Tubular/sk_lookup is about ingress.
|
| This blog post is about connected sockets on "egress".
|
| Actually, bpf sk lookup could be used on egress, but its not
| quite yet implemented
| loeg wrote:
| Nice drgn shout-out 2/3 of the way down the page!
|
| https://drgn.readthedocs.io/en/latest/index.html
| waynesonfire wrote:
| Can someone TLDR this? interesting question but not worth reading
| all this to find out.
|
| Can two IPs bind to the same IP / port on the server? Even if
| they come from the same client, my intuition says no.
| toast0 wrote:
| I'm having a hard time understanding your question. But let me
| try. If none of these answer your question, you're going to
| need to provide some more concrete details.
|
| If your sever has IPs 192.0.2.1 and 192.0.2.2. You can bind to
| 192.0.2.1:80 on one socket and 192.0.2.2:80 on another and
| listen for connections on both (or use it for outgoing
| connections, whatever). Or you could bind to 0.0.0.0:80 and
| listen for connections on either, as long as they don't have
| more specific bindings.
|
| If your client has IP 198.51.100.1, it can connect from tcp
| 198.51.100.1:18490 to 192.0.2.1:80 in one socket, and from tcp
| 198.51.100.1:18490 to 192.0.2.2:80 in another. You could also
| do the same on UDP, UDP's ports are orthogonal to TCP's.
|
| If you have another client with IP 198.51.100.2, it can also
| use local port 18490 to connect to both servers on port 80.
|
| A server can accept a virtually unlimited number of connections
| on a given tcp port, but a maximum of 64k per client IP, and
| your operating system will have some limit on the number of
| sockets; you can often raise that to some function of physical
| memory (FreeBSD limits you to a maximum of one fd for each four
| pages of physical memory); although, it takes a very specific
| load to be able to max out the FDs. Something like HAProxy in
| tcp forwarding mode can do it, but most other applications are
| going to use enough CPU and memory that you'll run out of those
| before you run out of FDs.
| jmholla wrote:
| I didn't really like the format of this article. The upfront quiz
| without much context or insight makes it feel like you're in
| class for an exam you didn't study for. The answers they provide
| are largely just Yes or No. When you get to the meat of the
| article, the quiz itself is forgotten until the conclusion. In
| fact, they even mention how useless the quiz is to provide
| insight at the beginning of the instructive content.
|
| > Is it all clear now? Well, probably no. It feels like reverse
| engineering a black box. So what is happening behind the scenes?
| Let's take a look.
|
| Keeping the quiz format, I think it would've been better to have
| it at the end then have the answers tie into the concepts you've
| learned. Or have the questions littered throughout the article
| and used to build upon the concepts it is teaching you. In its
| current state, I don't see this article's information having much
| staying power in my brain without concentrated study of the
| article.
| dan-robertson wrote:
| I enjoyed the quiz personally. I was kinda expecting more
| complicated questions, eg nat, iptables rules selecting local
| ip / interface, weird networking edge cases and so on.
| [deleted]
| AstixAndBelix wrote:
| This article is meant for tech people and the purpose of the
| quiz is make you realize how little you know about the subject
| even as an expert. This is not a pop-sci youtube video
| ilyt wrote:
| And it fails at that, all the format is is being annoying,
| and having to go thru links to random github gists is extra
| annoying if you didn't know the answer and want to just learn
| dcow wrote:
| Same. I don't want to go run python code to play your little
| quiz game, sorry.
|
| I think the article would be better structured as an
| explanation of what a connection _is_ , its 5-tuple, and then a
| dive into any surprising exceptions to the rule that there can
| be no ambiguity in where a packet is delivered, it must go to
| exactly one spot on a host. So as long as you can find an
| unambiguous subset of available 5-tuples, you're allowed to
| bind/connect.
| loeg wrote:
| The Python quiz code is easily read as identical to the
| equivalent syscalls. You can click through to the answers and
| read them out of the docstrings without ever running a Python
| interpreter (that's what I did).
| dcow wrote:
| It was 7am and I wasn't in the mood to hop back and forth
| between GH and the blog post trying a bunch of examples and
| piecing things together. I just wanted to read an
| interesting writeup. That's all.
| fierro wrote:
| agreed, the post is missing a clear explanation of what
| defines a TCP connection
| jkbs wrote:
| Fair point, though gauging what is the knowledge-level of
| your audience is hard. Gotta pick what to include and what
| to omit if you want to keep it concise.
| fierro wrote:
| for sure, gauging your audience is difficult. Still, any
| piece of writing benefits from a good framing. Great
| post, but I think jumping right into the details loses X%
| of potential readers. Keep up the good work :)
| jeffbee wrote:
| One of the best things about user-space TCP implementations is
| never needing to be an archaeologist, never trying to figure out
| the path-dependent sequence of accidents that leads to surprising
| and undocumented kernel behaviors.
| forgot_old_user wrote:
| But wouldn't the same issue apply to user-space TCP
| implementations too? User-space TCP implementations too could
| have "path-dependent sequence of accidents" which a power user
| might eventually need to figure out?
| jeffbee wrote:
| Yes but instead of being a 35-year-old accretion of mistakes,
| a user-space network stack is likely to be part of a more
| typical software lifecycle, that gets updated more easily and
| ultimately replaced. Also such things are dramatically easier
| to debug.
| loeg wrote:
| Happily Linux is at most a 32-year old accretion of
| mistakes (and I'm not completely confident 0.01 even had
| TCP).
| FPGAhacker wrote:
| What are arguments against user-space TCP?
| ilyt wrote:
| Normally if app opens a port it is not allowed to by firewall
| or application permissions it will just get error, with RAW
| sockets kernel would need to parse packet before deciding
| that.
|
| For example normally you will get permission denied when you
| try to listen on sub-1024 port on normal user.
|
| I'd also imagine if kernel is doing any kind of connection
| tracking (so really anything with firewall), it would be more
| optimal to _have_ that connection tracked in kernel vs
| decoding it and adding to conntrack table.
|
| I guess some kind of half-RAW could be done in place, like
| say a socket where you define protocol and port but handle
| actual packets in userspace ?
| jeffbee wrote:
| It doesn't work out-of-the-box with any existing software, so
| it only applies to things you are building from scratch, or
| that are structured in such a way that adding an alternate
| network mode, or that would work with a dynamic library that
| shims the entire sockets API. Your user-mode stack won't be
| observable by any existing monitoring tools. Also, your
| process needs to execute with CAP_NET_RAW.
| chatmasta wrote:
| One argument is that it requires building on top of a raw
| socket, which can open you to all sorts of ancient
| vulnerabilities that have been patched in the battle-tested
| code running in the kernel, e.g. this recent ICMP remote code
| execution vulnerability [0] ("An attacker could send a low-
| level protocol error containing a fragmented IP packet inside
| another ICMP packet in its header to the target machine. To
| trigger the vulnerable code path, an application on the
| target must be bound to a raw socket") [1].
|
| [0] Discussion: https://old.reddit.com/r/netsec/comments/11s8
| 0zo/cve20232341...
|
| [1] Advisory: https://msrc.microsoft.com/update-
| guide/vulnerability/CVE-20...
___________________________________________________________________
(page generated 2023-03-20 23:01 UTC)