[HN Gopher] Cloudflare servers don't own IPs anymore so how do t...
       ___________________________________________________________________
        
       Cloudflare servers don't own IPs anymore so how do they connect to
       the internet?
        
       Author : jgrahamc
       Score  : 460 points
       Date   : 2022-11-25 14:16 UTC (1 days ago)
        
 (HTM) web link (blog.cloudflare.com)
 (TXT) w3m dump (blog.cloudflare.com)
        
       | benlivengood wrote:
       | It's strange/sad how 50% of their problem is caused by geofencing
       | which is caused by archaic laws.
       | 
       | 40% is caused by archaic protocols that don't allow enough
       | addresses or local efficient P2P or ISP caching of public content
       | (sort of what IPFS aimed to do), which would alleviate much of
       | the need for CDNs in the first place. The remaining 10% is just
       | solving hard problems.
       | 
       | I'm a little surprised that splitting by port number gives
       | servers enough connections; maybe they are connection-pooling
       | some of them between their egress and final destination. If there
       | was truly a 1:1 mapping of all user requests to TCP connections
       | then I'd expect there to be ~thousands of simultaneous
       | connections to, say, Walmart today (black Friday) which is also
       | probably on an anycast address, limiting the number of unique
       | src:dst ip and port tuples to 65K per cloudflare egress IP
       | address. Maybe that ends up being not so bad and DCs can scale by
       | adding new IPs? https://blog.cloudflare.com/how-to-stop-running-
       | out-of-ephem... covers a lot of the details in solving that and
       | similar problems.
        
       | danrl wrote:
       | As an industry, we are bad at deprecating old protocols like
       | IPv4. This is a genius hack for a problem we have due to IPv6 not
       | being adopted widely enough so that serving legacy IP users
       | becomes a dropable liability to the business. The ROI is still
       | high enough for us to "innovate" here. I applaud the solution but
       | mourn the fact that we still need this.
       | 
       | I guess ingress is next, then? Two layers of Unimog to achieve
       | stability before TCP/TLS termination maybe.
        
         | dopylitty wrote:
         | I've been thinking a lot about this in my own enterprise and
         | I've increasingly come to the conclusion that IP itself is the
         | wrong abstraction for how the majority of modern networked
         | compute works. IPv6, as a (quite old itself) iteration on top
         | of IPv4 with a bunch of byzantine processes and acronyms tacked
         | on is solving the wrong problem.
         | 
         | Originally IP was a way to allow discrete physical computers in
         | different locations owned by different organizations to find
         | each other and exchange information autonomously.
         | 
         | These days most compute actually doesn't look like that. All my
         | compute is in AWS. Rather than being autonomous it is
         | controlled by a single global control plane and uniquely
         | identified within that control plane.
         | 
         | So when I want my services to connect to each-other within AWS
         | why am I still dealing with these complex routing algorithms
         | and obtuse numbering schemes?
         | 
         | AWS knows exactly which physical hosts my processes are running
         | on and could at a control plane level connect them directly.
         | And I, as someone running a business, could focus on the higher
         | level problem of 'service X is allowed to connect to service Y'
         | rather than figuring out how to send IP packets across
         | subnets/TGWs and where to configure which ports in NACLs and
         | security groups to allow the connection.
         | 
         | Similarly my ISP knows exactly where Amazon and CloudFlare's
         | nearest front doors are so instead of 15 hops and DNS
         | resolutions my laptop could just make a request to Service X on
         | AWS. My ISP could drop the message in AWS' nearest front door
         | and AWS could figure out how to drop the message on the right
         | host however they want to.
         | 
         | I know there's a lot of legacy cruft and also that there are
         | benefits of the autonomous/decentralized model vs central
         | control for the internet as a whole but given the centralized
         | reality we're in, especially within the enterprise, I think
         | it's worth reevaluating how we approach networking and whether
         | the continuing focus on IP is the best use of of our time.
        
           | ec109685 wrote:
           | The IP addresses you see as an AWS customer aren't the same
           | used to route packets between hosts. That said, there's a
           | huge amount of commodity infrastructure built up that
           | understands IP addresses and routing layers, so unless a new
           | scheme offers tremendous benefits, it won't get adoption.
           | 
           | At least from a security perspective though ip acl's are
           | falling out of favor to service based identities, which is a
           | good thing.
           | 
           | You can see how AWS internally does networking here:
           | https://m.youtube.com/watch?v=ii5XWpcYYnI
        
           | wpietri wrote:
           | > my laptop could just make a request to Service X on AWS
           | 
           | I was looking for the "just" that handwaves away the
           | complexity and I was not disappointed.
           | 
           | How do you imagine your laptop expressing a request in a way
           | that it makes it through to the right machine? Doing a
           | traceroute to amazon.com, I count 26 devices between me and
           | it. How will those devices know which physical connection to
           | pass the request over? Remember that some of them will be
           | handling absurd amounts of traffic, so your scheme will need
           | to work with custom silicon for routing as well as doing ok
           | on the $40 Linksys home unit. What are you imagining that
           | would be so much more efficient that it's worth the enormous
           | switching costs?
           | 
           | I also have questions about your notion of "centralization".
           | Are you saying that Google, Microsoft, and other cloud
           | vendors should just... give up and hand their business to
           | AWS? Is that also true for anybody who does hosting,
           | including me running a server at home? If so, I invite you to
           | read up on the history of antitrust law, as there are good
           | reasons to avoid a small number of people having total
           | control over key economic sectors.
        
             | dopylitty wrote:
             | > How do you imagine your laptop expressing a request in a
             | way that it makes it through to the right machine? Doing a
             | traceroute to amazon.com, I count 26 devices between me and
             | it. How will those devices know which physical connection
             | to pass the request over?
             | 
             | That's my whole point. You're thinking of it from an IP
             | perspective where there are individual devices in some
             | chain and they all need to autonomously figure out a path
             | from my laptop to AWS. The reality is every device between
             | me and AWS is owned by my ISP. They know exactly which
             | physical path ahead of time will get a message from my
             | laptop to AWS. So why waste all the time on the IP
             | abstraction?
             | 
             | > I also have questions about your notion of
             | "centralization". Are you saying that Google, Microsoft,
             | and other cloud vendors should just... give up and hand
             | their business to AWS?
             | 
             | AWS is just an example. Realistically a huge amount of
             | traffic on the internet is going to 6 places and my ISP
             | already has direct physical connections to those places.
             | Maintaining this complex and byzantine abstraction to
             | figure out how to get a message from my laptop to compute
             | in those companies' infrastructure should not be necessary.
             | 
             | And in general the more important part is within AWS' (or
             | Microsoft's or enterprise X's) network why waste time on IP
             | when the network owner knows exactly which host every
             | compute process is running on?
             | 
             | Instead of thinking of an enterprise network as a set of
             | autonomous hosts that need to figure out a path between
             | each other think of it as a set of processes running on the
             | same OS (the virtual infrastructure). Linux doesn't need to
             | do BGP to figure out how to connect two processes so why
             | does your network?
        
               | error503 wrote:
               | > That's my whole point. You're thinking of it from an IP
               | perspective where there are individual devices in some
               | chain and they all need to autonomously figure out a path
               | from my laptop to AWS. The reality is every device
               | between me and AWS is owned by my ISP. They know exactly
               | which physical path ahead of time will get a message from
               | my laptop to AWS. So why waste all the time on the IP
               | abstraction?
               | 
               | IP is concrete, not abstract. Whatever the form the
               | network takes, when you make a request, your ISP is going
               | to have a make a decision on how to route it over their
               | physical assets to get it to the desired destination.
               | Unless you are talking about your ISP provisioning a
               | physical circuit directly between you and Amazon, with no
               | multiplexing and no equipment on it, you are going to
               | have those hops whether you use IP to choose the route or
               | not. That is not really negotiable, or you're not
               | describing a network (or something even remotely viable)
               | at all. Maybe that path is invisible to you, but it
               | exists.
               | 
               | And in fact, in many or even most carrier networks, this
               | is abstracted in much the way that you describe _within
               | that particular network_ using MPLS. But this approach
               | doesn 't scale to the scope of the Internet, requires all
               | edge devices have complete knowledge of every necessary
               | path in the network, and makes inter-networking more
               | difficult because every endpoint and its end-to-end path
               | to every other needs to be shared and synchronized. This
               | is actually more complex, and much more brittle, than the
               | current implementation. And for what? I still have yet to
               | understand what advantage you think would be gained here.
               | Right now, if you want to talk to s3, you send a packet
               | to s3, and your ISP does all the 'complex and byzantine'
               | work. What do you care how they do it?
               | 
               | Ignoring the long tail is also silly. FAANG might
               | represent a majority of traffic on the Internet, but the
               | long tail is huge, and you can't just hand-wave it away
               | like that. Enabling it is what makes the Internet what it
               | is, and if your proposal doesn't account for it, it's
               | dead in the water.
               | 
               | > And in general the more important part is within AWS'
               | (or Microsoft's or enterprise X's) network why waste time
               | on IP when the network owner knows exactly which host
               | every compute process is running on?
               | 
               | Knowing that is the easy part. You still need to figure
               | out a path and actually route the packets along it. You
               | still need to deal with path selection, load balancing,
               | fault tolerance, and synchronizing any necessary state.
               | You still need devices along the path to know what to do
               | with a packet they receive, somehow. It turns out that
               | hop-by-hop routing is an efficient and viable way to
               | accomplish this.
               | 
               | > Instead of thinking of an enterprise network as a set
               | of autonomous hosts that need to figure out a path
               | between each other think of it as a set of processes
               | running on the same OS (the virtual infrastructure).
               | Linux doesn't need to do BGP to figure out how to connect
               | two processes so why does your network?
               | 
               | Because the 'network' you describe is not a network? It's
               | processes running on the same machine? This is not
               | analogous at all to a large distributed network like the
               | Internet.
        
               | scarmig wrote:
               | > The reality is every device between me and AWS is owned
               | by my ISP. They know exactly which physical path ahead of
               | time will get a message from my laptop to AWS.
               | 
               | Neither of these are true in general. And suppose AWS (or
               | GCP, or Azure, or Cloudflare...) decides to add a new
               | POP. How do they broadcast to your ISP and all the other
               | ISPs in the world how exactly to send datagrams to it?
        
               | nextaccountic wrote:
               | > That's my whole point. You're thinking of it from an IP
               | perspective where there are individual devices in some
               | chain and they all need to autonomously figure out a path
               | from my laptop to AWS. The reality is every device
               | between me and AWS is owned by my ISP. They know exactly
               | which physical path ahead of time will get a message from
               | my laptop to AWS. So why waste all the time on the IP
               | abstraction?
               | 
               | The ISP can internally use MPLS to do routing exactly the
               | way you suggest: build a "circuit" between you and Amazon
               | and then route the packets internally in their networks
               | through this circuit, instead of using IP. This works
               | because the ISP has a global view of their network and as
               | such their routers don't need to work independently. MPLS
               | was needed in a time where IP routing was too slow, but
               | nowadays you can get full speed without MPLS.
               | 
               | But anyway, it doesn't matter how packets are routed
               | internally within each network, IP is still required for
               | routing between different networks. Which is actually
               | super common!
        
               | [deleted]
        
               | wpietri wrote:
               | Sorry, this still sounds like architecture astronautics
               | to me. You have an intuition that things could maybe be
               | simpler, which is not a bad place to start, but then you
               | have to ground it in the actual reality.
               | 
               | > You're thinking of it from an IP perspective where
               | there are individual devices in some chain
               | 
               | What do you plan to replace the individual devices with?
               | 
               | Looking at the traceroute in question, something I'd
               | suggest you do for your own, I see one router owned by
               | me, 6 by my ISP, 2 at an interchange, 5 for a backbone, a
               | number of intermediate ones of mysterious ownership, and
               | finally one owned by Amazon, presumably with a bunch of
               | other Amazon hops that are hidden from me.
               | 
               | These are physical devices connected by physical links
               | (wires, cables, fiber). Wires that people installed.
               | Wires that break. Connected to devices that break. What
               | in your proposed grand vision will happen there? Please
               | start with the happy path and then give detail on the
               | failure cases.
               | 
               | A similar problem applies in Amazon. The abstraction they
               | hand you is pretty convenient. But that abstraction is
               | made out of many millions of devices connected up in
               | cunning ways.
               | 
               | Linux can connect two processes because the kernel has
               | total control over a modest number of CPUs and RAM. That
               | just doesn't compare to literal billions of
               | interconnected devices with no central control. It's like
               | saying that because your dad knows your mom's name and
               | how to reach her, he should be able to do the same thing
               | for everybody in his country.
        
           | error503 wrote:
           | > So when I want my services to connect to each-other within
           | AWS why am I still dealing with these complex routing
           | algorithms and obtuse numbering schemes?
           | 
           | > AWS knows exactly which physical hosts my processes are
           | running on and could at a control plane level connect them
           | directly. And I, as someone running a business, could focus
           | on the higher level problem of 'service X is allowed to
           | connect to service Y' rather than figuring out how to send IP
           | packets across subnets/TGWs and where to configure which
           | ports in NACLs and security groups to allow the connection.
           | 
           | You shouldn't be? Doesn't AWS number your machines for you
           | automatically and give you a unique ID you can use with DNS
           | to reach it? And also provide a variety of 'ingress' services
           | to abstract load balancing and security as well? I'm not a
           | consumer of AWS services in my dayjob, but isn't this their
           | entire raison d'etre? Otherwise you may as well just run much
           | cheaper VMs elsewhere.
           | 
           | > Similarly my ISP knows exactly where Amazon and
           | CloudFlare's nearest front doors are so instead of 15 hops
           | and DNS resolutions my laptop could just make a request to
           | Service X on AWS. My ISP could drop the message in AWS'
           | nearest front door and AWS could figure out how to drop the
           | message on the right host however they want to.
           | 
           | Uhm, aside from handwaving away how your ISP is going to give
           | you a direct, no-hops connection to AWS, this is pretty much
           | exactly what your ISP is doing. Hell, in some cases, your ISP
           | has abstracted the underlying backbone hops too using
           | something like MPLS, and this is completely invisible to you
           | as an end user. You or your laptop don't have to think about
           | the network part of things at all. You ask to connect to s3,
           | your laptop looks up the service's IP address (unique ID) in
           | DNS, sends some packets, and your ISP routes them to
           | CloudFlare's nearest front doors.
           | 
           | There are some good arguments to be made for a message-
           | passing focused rather than connection focused protocol
           | model, but that doesn't seem to be what you're talking about.
           | What you seem to be talking about is doing away with routing
           | altogether, and even in a relatively centralized internet,
           | that just makes zero sense. We will continue to need the
           | aggregation layer, we will continue to have multiple routes
           | to a resource through multiple hops that need to be resolved
           | into a path, and we'll continue to need a way to uniquely
           | identify a service endpoint.
        
           | akira2501 wrote:
           | > Rather than being autonomous it is controlled by a single
           | global control plane and uniquely identified within that
           | control plane.
           | 
           | By default, sure. You can easily bring your own IPs into AWS
           | and use them instead, and I don't think it's hard to imagine
           | the pertinent use cases and risk management this brings.
        
         | otabdeveloper4 wrote:
         | You misunderstand what "the internet" is. "The internet" is a
         | process and methodology for connectivity between LANs. And 48
         | addressing bits is more than enough to solve this problem.
         | 
         | TCP/IP is full of cruft and causes a shitload of unnecessary
         | problems, but increasing the number of bits in the addressing
         | space solves literally none of them.
        
       | mike256 wrote:
       | Wouldn't it be better when all those big CDNs just switch off
       | IPv4 and force the sleeping ISPs to enable IPv6? Maybe we should
       | introduce some IPv6 only days as a first step...
        
       | subarctic wrote:
       | Pretty interesting article. TLDR: they're now using anycast for
       | egress, not just ingress.
       | 
       | Each data center has a single IP for each country code (so that
       | they can make outgoing requests that are geolocated in any
       | country). In order to achieve that, they have a /24 or larger
       | range for each country, and announce it from all their data
       | centers, and then they route the traffic over their backbone to
       | the appropriate data center for that IP.
       | 
       | Then in the data center, they share the single IP across all
       | their servers by giving each server a range of TCP/UDP port space
       | (instead of doing stateful NAT).
        
         | ec109685 wrote:
         | It's not a single IP address per data center. Otherwise they'd
         | only be able to make 64k simultaneous egress connections, nor
         | would their scheme of different ip addresses per "geo" and
         | product work.
         | 
         | Edit: based on this part of blog post:
         | 
         | "With a port slice of say 2,048 ports, we can share one IP
         | among 31 servers. However, there is always a possibility of
         | running out of ports. To address this, we've worked hard to be
         | able to reuse the egress ports efficiently."
        
           | icehawk wrote:
           | Do you mean 64k simultaneous egress connections per origin
           | server IP? That's still a reasonable number of connections.
        
             | ec109685 wrote:
             | I might not be understanding, but it needs to know how to
             | route the packets back to the server and it does that with
             | ports.
        
           | error503 wrote:
           | I don't know if their stack allows for it, but the number of
           | concurrent connections is limited by unique 5-tuples (which
           | is what the host uses to identify the sockets), not just
           | source ports. There is presumably quite a lot of entropy
           | available in destination IP & port to enable far more than
           | 2048 connections per server. But it's hard to be
           | deterministic about how much, so they might be choosing to
           | ignore this and use the lower bound.
        
           | grogers wrote:
           | I guess for large data centers they could use multiple IPs
           | and the overall scheme stays essentially the same, but the
           | article seems to strongly imply that it is one single egress
           | IP per DC. The 64k simultaneous connections limitation is
           | only per origin server IP.
        
         | 0xbkt wrote:
         | > TLDR: they're now using anycast for egress, not just ingress.
         | 
         | I don't see this behavior in my setup. I have a server in AMS
         | and connect from IST, but `netstat` reveals a unicast IP
         | address that is routed back to IST.
        
       | Terretta wrote:
       | I quite like what CloudFlare has done here.
       | 
       | There's a fourth way to resolve this, that works for the core use
       | case, is less engineering, and was in production 20 years ago,
       | but I can't fit it within the margins of this comment box.
       | 
       | // CF's approach has additional feature advantages though.
        
       | [deleted]
        
       | dfawcus wrote:
       | What they describe sounds a lot like a distributed static RSIP
       | scheme.
       | 
       | https://en.wikipedia.org/wiki/Realm-Specific_IP
       | 
       | With port ranges rather than being 'leased', being allocated on
       | the the basis of per server within a locale.
       | 
       | So the IP goes to the locale, the port range is the the static
       | RSIP to the server within that locale.
        
       | martinohansen wrote:
       | Am I missing something here or did they just reinvent a NAT
       | gateway with static rules?
       | 
       | I understand that they started using anycast for the egress IPs
       | as well, but thats unrelated to the NAT problem.
        
         | [deleted]
        
       | xg15 wrote:
       | > _However, while anycast works well in the ingress direction, it
       | can 't operate on egress. Establishing an outgoing connection
       | from an anycast IP won't work. Consider the response packet. It's
       | likely to be routed back to a wrong place - a data center
       | geographically closest to the sender, not necessarily the source
       | data center!_
       | 
       | Slightly OT question, but why wouldn't this be a problem with
       | ingress, too?
       | 
       | E.g. suppose I want to send a request to https://1.2.3.4. What I
       | don't know is that 1.2.3.4 is an anycast address.
       | 
       | So my client sends a SYN packet to 1.2.3.4:443 to open the
       | connection. The packet is routed to data center #1. The data
       | center duly replies with a SYN/ACK packet, which my client
       | answers with an ACK packet.
       | 
       | However, due to some bad luck, the ACK packet is routed to data
       | center #2 which is also a destination for the anycast address.
       | 
       | Of course, data center #2 doesn't know anything about my
       | connection, so it just drops the ACK or replies with a RST. In
       | the best case, I can eventually resend my ACK and reach the right
       | data center (with multi-second delay), in the worst case, the
       | connection setup will fail.
       | 
       | Why does this not happen on ingress, but is a problem for egress?
       | 
       | Even if the handshake uses SYN cookies and got through on data
       | center #2, what would keep subsequent packets that I send on that
       | connection from being routed to random data centers that don't
       | know anything about the connection?
        
         | error503 wrote:
         | In addition to what others have said about route stability,
         | there is possibly a more relevant point to make: for 'ingress'
         | traffic, the anycast IP is the _destination_ , and for 'egress'
         | traffic it is the _source_. This distinction is important
         | because the system doesn 't have any reasonable way to
         | anticipate where the reply traffic to the anycast IP will
         | actually route to. A packet is launched to the Internet from an
         | anycast IP, and the replies could come to anywhere that IP is
         | announced - most likely not the same place the connection
         | originated. Contrast to when the (unicast) client makes the
         | first move, and the anycast node that gets the initial packets
         | is highly likely to get any follow up packets as well, so in
         | the vast majority of cases, this works just fine.
        
         | matsur wrote:
         | This is a problem in theory. In practice (and through
         | experience) we see very little routing instability in the way
         | you describe.
        
           | xg15 wrote:
           | You mean, it's just luck?
        
             | Brian_K_White wrote:
             | right? also seems like load should or at least could be
             | changing all the time. geo or hops proximity is really the
             | only things that decide a route? not load also?
             | 
             | But although I would be surprised if load were not also
             | part of the route picker, I would also be surprised if the
             | routers didn't have some association or state tracking to
             | actively ensure related packets get the same route.
             | 
             | But I guess this is saying exactly that, that it's relying
             | on luck and happenstance.
             | 
             | It may be doing the job well enough that not enough people
             | complain, but I wouldn't be proud of it myself.
        
               | remram wrote:
               | Anycast is implemented by BGP and doesn't take load into
               | account in any way. You will reach the closest location
               | announcing that address (well, prefix).
        
               | ignoramous wrote:
               | TFA claims that _Anycast_ is an advantage when dealing
               | with DDoS because it helps spread the load? A regional
               | DDoS (where it consistently hits a small set of DCs) is
               | not a common scenario, I guess?
        
               | csande17 wrote:
               | Basically yes. Large-scale DDoS attacks rely on
               | compromising random servers and devices, either directly
               | with malware or indirectly with reflection attacks. Those
               | hosts aren't all going to be located in the same place.
               | 
               | An attacker could choose to only compromise devices
               | located near a particular data center, but that would
               | really reduce the amount of traffic they could generate,
               | and also other data centers would stay online and serve
               | requests from users in other places.
        
               | toast0 wrote:
               | Your intuition is more or less all wrong here, sorry.
               | 
               | Most routers with multiple viable paths pass was too much
               | traffic to do state tracking of individual flows. Most
               | typically, the default metric is BGP path length, for a
               | given prefix, send packets through the route that has the
               | most specific prefix, if there's a tie, use the route
               | that transits the fewest networks to get there, if
               | there's still a tie, use the route that has been up the
               | longest (which maybe counts as state tracking). Routing
               | like this doesn't take into account any sort of load
               | metric, although people managing the routers might do
               | traffic engineering to try to avoid overloaded routes
               | (but it's difficult to see what's overloaded a few hops
               | beyond your own router).
               | 
               | For the most part, an anycast operation is going to work
               | best if all sites can handle all the forseable load,
               | because it's easy to move all the traffic, but it's not
               | easy to only move some. Everything you can do to try to
               | move some traffic is likely to either not be effective or
               | move too much.
        
               | richieartoul wrote:
               | Why shouldn't they be proud of a massive system like
               | Cloudflare that works extremely well? As a commentor
               | below described, it's not luck or happenstance, it's a
               | natural consequence of how BGP works. Seems pretty
               | elegant to me.
        
             | rizky05 wrote:
             | calculated luck
        
             | [deleted]
        
         | tonyb wrote:
         | It works because the route to 1.2.3.4 is relatively stable. The
         | routes would only change and end up at data center #2 if data
         | center #1 stopped announcing the routes. In that case the
         | connection would just re-negotiate to data center #2.
        
           | xg15 wrote:
           | Ah, ok, that makes sense. So for a given point of origin,
           | anycast generally routes to the same server?
        
             | majke wrote:
             | Correct. From a single place, you're likely to BGP-reach
             | one Cloudflare location, and it doesn't change often.
        
         | ratorx wrote:
         | As others have mentioned, this is not often a problem because
         | routing is normally fairly stable (at least compared to the
         | lifetime of a typical connection). For longer lived connections
         | (e.g. video uploads), it's more of a problem.
         | 
         | Also, there are a fair number of ASes that attempt to load
         | balance traffic between multiple peering points, without
         | hashing (or only using the src/dst address and not the port).
         | This will also cause the problem you described.
         | 
         | In practice it's possible to handle this by keeping track of
         | where the connections for an IP address typically ingress and
         | sending packets there instead of handling them locally. Again,
         | since it's a few ASes that cause problems for typical
         | connections, is also possible to figure out which IP prefixes
         | experience the most instability and only turn on this overlay
         | for them.
        
         | grogers wrote:
         | Yep, it can happen that your packet gets routed to a different
         | DC from a prior packet. But the routers in between the client
         | and the anycast destination will do the same thing if the
         | environment is the same. So to get routed to a new location,
         | you would usually need either:
         | 
         | * A new (usually closer) DC comes online. That will probably be
         | your destination from now on.
         | 
         | * The prior DC (or a critical link on the path to it) goes
         | down.
         | 
         | The crucial thing is that the client will typically be routed
         | to the closest destination to it. In the egress case the
         | current DC may not be the closest DC to the server it is trying
         | to reach so the return traffic would go to the wrong place.
         | This system of identifying a server with unique IP/port(s)
         | means that CF's network can forward the return traffic to the
         | correct place.
        
           | 9935c101ab17a66 wrote:
           | Dang this stuff is super interesting. As someone who knows
           | very little about networking and DNS, both your comment and
           | the parent comment are incredible insightful and
           | illuminating! Thanks for sharing.
        
         | ignoramous wrote:
         | Yes, as others have mentioned, route flapping is a problem.
         | But, in practice, not as big a problem as DNS-based routing.
         | 
         | - See: https://news.ycombinator.com/item?id=10636547
         | 
         | - And: https://news.ycombinator.com/item?id=17904663
         | 
         | Besides, SCTP / QUIC aware load balancers (or proxies) are
         | detached from IPs and should continue to hum along just fine
         | regardless of which server IP the packet ends up at.
        
       | Thorentis wrote:
       | The fact that we haven't yet adopted IPv6 tells me that IPv6
       | isn't actually that great of a solution. We need an Internet
       | Protocol that solves modern problems and that has a good
       | migration path.
        
         | wpietri wrote:
         | 40% of Google's traffic comes via IPv6. Up from 1% a decade
         | ago. https://www.google.com/intl/en/ipv6/statistics.html
         | 
         | If you think you can do better than that, I look forward to
         | hearing your plan. Personally, I think that's huge progress.
        
         | eastdakota wrote:
         | Fun fact: the first product we announced to celebrate
         | Cloudflare's launch day anniversary was a IPv4<->IPv6 gateway:
         | 
         | https://blog.cloudflare.com/introducing-cloudflares-automati...
         | 
         | The success of that convinced us we should do something to
         | improve the Internet every year to celebrate our "birthday."
         | Over time we ended up with more than one product that met that
         | criteria and timing, so it went from a day of celebration to a
         | week. That became our Birthday Week. Then we saw how well
         | bundling a set of announcements into a week was so we decided
         | to do it other times of the year. And that's how Cloudflare
         | Innovation Weeks got started, explicitly with us delivering
         | IPv6 support back in 2011.
        
         | growse wrote:
         | You need an IPv4 src address to connect out to an IPv4 origin.
        
         | zekica wrote:
         | Where do they say that they haven't adopted IPv6? All their
         | offerings support IPv6.
        
       | inopinatus wrote:
       | TLDR: Cloudflare is using five bits from the port number as a
       | subnetting & routing scheme, with optional content policy
       | semantics, for hosts behind anycast addressing and inside their
       | network boundary.
        
         | huggingmouth wrote:
         | By far the best summary I've come across so far. Thank you!
        
       | pm2222 wrote:
       | Packet headers contain essentially info that labels a packet. In
       | v4 world they are running out of bits so they are maneuvering in
       | my view that's cool yet at the same time boring to me.
       | 
       | Perhaps they should go v6 internally and implement one v4-v6
       | gateway to put all these tricks behind.
        
       | Ptchd wrote:
       | If you don't need an IP to be connected to the internet, sign me
       | up... I think they are full of it though... Even if you only have
       | one IP.... you still have an IP
       | 
       | > PING cloudflare.com (104.16.133.229) 56(84) bytes of data.
       | 
       | > 64 bytes from 104.16.133.229 (104.16.133.229): icmp_seq=1
       | ttl=52 time=10.6 ms
       | 
       | With a ping like this, you know that I am not using Musk's
       | Internet....
        
       | cesarb wrote:
       | All this wonderful complexity, just because a few servers insist
       | on behaving as if the location of the IP address and the location
       | of the user should always match.
        
       | Rasbora wrote:
       | Whenever I see the name Marek Majkowski come up, I know the blog
       | post is going to be good.
       | 
       | I had to solve this exact problem a year ago when attempting to
       | build an anycast forward proxy, quickly came to the conclusion
       | that it'd be impossible without a massive infrastructure
       | presence. Ironically I was using CF connections to debug how they
       | might go about this problem, when I realized they were just using
       | local unicast routes for egress traffic I stopped digging any
       | deeper.
       | 
       | Maintaining a routing table in unimog to forward lopsided egress
       | connections to the correct DC is brilliant and shows what is
       | possible when you have a global network to play with, however I
       | wonder if this opens up an attack vector where previously
       | distributed connections are now being forwarded & centralized at
       | a single DC, especially if they are all destined for the same
       | port slice...
        
       | soumendrak wrote:
       | Cloudflare and Akamai: competition
        
       | mcint wrote:
       | Given the discussion under
       | https://news.ycombinator.com/item?id=33743935 about opening a
       | connection to 1.2.3.4:443 (I'm prompted by the same curiosity
       | about ingress load-balancing, _statelessly_ )...
       | 
       | How does the ingress "router" load-balance incoming connections,
       | which it must (even if the "router" is a default host or
       | cluster)? CF isn't opening TCP, then HTTP just to send redirects
       | to another IP for the same cluster.
       | 
       | I guess hashing, on IP and port, is already readily used in
       | routing decisions, so eyeball-addr and -port of the inbound
       | packets, 4-tuple (CF-addr [fixed, shared], CF-port [443],
       | eyeball-addr, eyeball-port), provides a consistent.
       | 
       | I guess that this is good for 99.9 percent of _connections_ ,
       | which are short-lived, and opportunistically kept open or reused.
       | I suppose other long-lived connections might take place in the
       | context of an application that tracks data above and outside of
       | TCP-alone. I'm grasping for a missing middle, in size of use
       | case, and can't quickly name things that people might proxy but
       | need stable connections. CloudFlare's reverse proxying to web
       | servers would count, if the web-fronting had to traverse someone
       | else's proxy layer.
       | 
       | What are the rough edges here? What's are next challenges here to
       | build around?
        
       | jesuspiece wrote:
        
       | ronnier wrote:
       | Spammers are exploiting cloudflare by creating thousands of new
       | domains on the free tld (like .ml) and hosting the sites behind
       | cloudflare and spamming social media apps with links to scam
       | dating sites. CPA scammers.
       | 
       | If anyone from CF sees this, I can work with you and give you
       | data on this. I'm dealing with this at one of the large social
       | media companies.
       | 
       | Here's an example, this is NSFW - https://atragcara.ga
        
         | elorant wrote:
         | So why aren't social media platforms blocking the domains?
        
           | ronnier wrote:
           | We do. But with free TLD's, spammers and scammers can create
           | an unlimited number of new domains at zero cost. That's the
           | problem. They can send a single spam URL to a single person
           | and scale that out, each person gets a unique domain and URL.
        
             | elorant wrote:
             | So how about blocking the users then? Or limit their
             | ability to post links.
        
               | ronnier wrote:
               | That's done too. But it's not just a few, it's literally
               | 10s of thousands of individuals from places like
               | Bangladesh who do this as their source of income. They
               | are smart, on real devices, will solve any puzzle you
               | throw at them, and will adapt to any blocks or locking.
               | It's not an easy problem to solve which is why no
               | platform has solved it (oddly, spam is pretty much non
               | existent on HN)
        
               | elorant wrote:
               | I don't think there's any benefit in spamming HN. There
               | aren't that many users in here, and it could lead to a
               | backlash consider the technical expertise of most people.
        
             | gnfargbl wrote:
             | OK, but why don't you block Freenom domains entirely?
             | 
             | Apart from perhaps a couple of sites like gob.gq, there's
             | essentially nothing of any value on those TLDs. Allow-list
             | the handful of good sites, if you must, and default block
             | the rest.
        
               | ronnier wrote:
               | I could. But we are talking about one of the worlds
               | largest social media platforms used by hundreds of
               | millions of people daily. There's legit websites hosted
               | on these free domains and I don't want to kill those
               | along with the scam sites. I've mostly got the scam sites
               | blocked at this point though. Just took me a week or so
               | to adapt.
        
               | gnfargbl wrote:
               | > There's legit websites hosted on these free domains
               | 
               | Are there though, really? Can you give some examples?
               | 
               | To a first approximation, I contend that essentially
               | everything on Freenom is bad. There are maybe a _handful_
               | of good sites (the one I listed, https://koulouba.ml/,
               | etc) but you can find those on Google in a few minutes
               | with some _site:_ searches.
               | 
               | I commend your efforts in blocking the scam sites, but
               | also honestly believe that it would be better for you,
               | your customers and the internet at large to default block
               | Freenom. Freenom sites are junk, wherever they are
               | hosted.
        
               | ronnier wrote:
               | Here's NSFW scam sites behind CF that use free TLDs. I
               | could post 10s of thousands of these.
               | 
               | * https://atragcara.ga
               | 
               | * https://donaga.tk
               | 
               | * https://snatemhatzemerbedc.tk
        
               | gnfargbl wrote:
               | Yep, I know. I monitor these as they appear in
               | Certificate Transparency logs and DNS PTR records.
               | 
               | Freenom TLDs are just junk. Save yourself the hassle and
               | default block :-).
        
               | ronnier wrote:
               | Seems these sites should be blocked on CF, at the root.
               | Not all the leaf nodes apps. It's pretty easy for me to
               | automate it at my company. Seems CF could?
        
         | sschueller wrote:
         | Same goes for DDoS attacks. I am not sure how they do it but we
         | get hit by CF IPs with synfloods etc.
        
           | gnfargbl wrote:
           | Anyone can set the source IP on their packets to be anything.
           | I can send you TCP SYNs which are apparently from Cloudflare.
           | 
           | There was a proposal (BCP38) which said that networks should
           | not allow outbound packets with source IPs which could not
           | originate from that network, but it didn't really get a lot
           | of traction -- mainly due to BGP multihoming, I think.
        
             | toast0 wrote:
             | BCP38 has gotten some traction, but it's not super
             | effective until all the major tier-1 ISPs enforce it
             | against their customers. But it's hard to pressure tier-1
             | ISPs; you can't drop connections with them, because they're
             | too useful, anyway if you did, the traffic would just flow
             | through another tier-1 ISPs, because it's not really
             | realistic for tier-1s to prefix filter peerings between
             | themselves. Anyway, the customer that's spoofing could be
             | spoofing sources their ISP legitimagely handles, and
             | there's a lot of those.
             | 
             | Some tier-1s do follow BCP38 though, so one day maybe?
             | Still, there's plenty of abuse to be done without spoofing,
             | so while it would be an improvement, it wouldn't usher in
             | an era of no abuse.
        
           | slothsarecool wrote:
           | You do not get attacked from Cloudflare with TCP attacks.
           | Somebody is spoofing the IP header and make it seem like
           | Cloudflare is DDoSing you.
           | 
           | The only way for somebody to DDoS from Cloudflare would be
           | using workers, however, this isn't practical as workers have
           | a very limited IP Range.
        
             | patricklorio wrote:
             | I run a fairly popular service and have received DDoS
             | attacks from Cloudflare's IP range (~20gbps). I can confirm
             | they respond to SYN+ACK with an ACK to complete the TCP
             | handshake. Through some investigating it seems like a
             | botnet using Cloudflare WARP (their VPN service).
        
             | fncivivue7 wrote:
        
             | cmeacham98 wrote:
             | The reason people do this, by the way, is because it's
             | common if you're hosting via CF to whitelist their IPs and
             | block the rest. This allows their SYN flood to bypass that.
        
         | [deleted]
        
       | uvdn7 wrote:
       | This is a wonderful article. Thanks for sharing. As always,
       | Cloudflare blog posts do not disappoint.
       | 
       | It's very interesting that they are essentially treating IP
       | addresses as "data". Once looking at the problem from a
       | distributed system lens, the solution here can be mapped to
       | distributed systems almost perfectly.
       | 
       | - Replicating a piece of data on every host in the fleet is
       | expensive, but fast and reliable. The compromise is usually to
       | keep one replica in a region; same as how they share a single /32
       | IP address in a region.
       | 
       | - "sending datagram to IP X" is no different than "fetching data
       | X from a distributed system". This is essentially the underlying
       | philosophy of the soft-unicast. Just like data lives in a
       | distributed system/cloud, you no longer know where is an IP
       | address located.
       | 
       | It's ingenious.
       | 
       | They said they don't like stateful NAT, which is understandable.
       | But the load balancer has to be stateful still to perform the
       | routing correctly. It would be an interesting follow up blog post
       | talking about how they coordinate port/data movements (moving a
       | port from server A to server B), as it's state management (not
       | very different from moving data in a distributed system again).
        
         | remram wrote:
         | I have a lot of trouble mapping your comment to the content of
         | the article. It is about the _egress addresses_ , the ones
         | CloudFlare use as source when fetching from origin servers.
         | Those addresses need to be separated by the region of the end-
         | user ("eyeball"/browser) and the CloudFlare service they are
         | using (CDN or WARP).
         | 
         | The cost they are working around is the cost of IPv4 addresses,
         | versus the combinatorial explosion in their allocation scheme
         | (they need number of services * number of regions * whatever
         | dimension they add next, because IP addresses are nothing like
         | data).
         | 
         | I am not sure where you see data replication in this scheme?
        
           | uvdn7 wrote:
           | It's not meant to be a perfect analogy. The replication
           | analogy is mostly talking about the tradeoff between
           | performance and cost. So it's less about "replicating" the ip
           | addresses (which is not happening). On that front, maybe
           | distribution would be a better term. Instead of storing a
           | single piece of data on a single host (unicast), they are
           | distributing it to a set of hosts.
           | 
           | Overall, it seems like they are treating ip addresses as data
           | essentially, which becomes most obvious when they talk about
           | soft-unicast.
           | 
           | Anyway, I just found it interesting to look at this through
           | this lens.
        
             | majke wrote:
             | "Overall, it seems like they are treating ip addresses as
             | data essentially"
             | 
             | Spot on!
             | 
             | In past:
             | 
             | * /24 per datacenter (BGP), /32 per server (local network)
             | (all 64K ports)
             | 
             | New:
             | 
             | * /24 per continent (group of colos), /32 per colo, port-
             | slice per server
             | 
             | This is totally hierarchical. All we did is build a tech to
             | change the "assignment granularity". Now with this tech we
             | can do... anything we want. We're not tied to BGP, or IP's
             | belonging to servers, or adjacent IP's needing to be
             | nearby.
             | 
             | The cost is the memory cost of global topology. We don't
             | want a global shared-state NAT (each 2 or 4-tuple being
             | replicated globally on all servers). We don't want zero-
             | state (a machine knowing nothing about routing, just BGP
             | does the job). We want to select a reasonable mix. Right
             | now it's /32 per datacenter.... but we can change it if we
             | want and be more, or less specific than that.
        
               | 0xbkt wrote:
               | Is this feature fully rolled out yet? I still see unicast
               | IPs connecting from the entry colo and traceroute
               | confirms that.
        
               | ludikalell2 wrote:
               | Only downside seems more stress on quicksilver ;)
        
       | superkuh wrote:
       | Yikes. More cloudflare breakage of the internet model. Pretty
       | soon we might as well all just live within cloudflare's WAN
       | entirely.
        
         | eastdakota wrote:
         | -\\_(tsu)_/-
         | 
         | Another perspective is that the connection of an IP to specific
         | content or individuals was a bug of the Internet's original
         | design and thankfully we're finally finding ways to
         | disassociate them.
        
         | AlphaSite wrote:
         | The internets a set of abstractions, as long as they still
         | implement some common protocols and don't create a walled
         | garden, is there any real social or technical issue with them
         | doing unusual things in their network?
         | 
         | I can totally see an argument against their CDN being too
         | pervasive and problematic for TOR users, but this seems fine
         | IMO.
        
         | wrs wrote:
         | What's breaking the internet model is the internet becoming too
         | popular and running out of addresses. There's nothing specific
         | to Cloudflare here. You're free to do the same thing to
         | conserve your own address space. It's sort of a super-fancy
         | NAT.
        
         | majke wrote:
         | Author here, I know this is a dismissive comment, but I'll bite
         | anyway.
         | 
         | As far as I understand the history of the IP protocol,
         | initially an IP address pointed to a host. (/etc/hosts file
         | seems that way)
         | 
         | Then it was realized a single entity might have multiple
         | network interfaces, and an IP started to point to a network
         | card on a host. (a host can have many IP's). Then all the VRF,
         | dummy devices, tuntaps, VETH and containers. I guess an IP is
         | now pointing to a container or VM. But there is more. For
         | performance you can (almost should!) have an unique IP address
         | per NUMA node. Or even logical CPU.
         | 
         | In modern internet a server IP: points to a single CPU on a
         | container in a VM on a host.
         | 
         | Then consider Anycast, like 1.1.1.1 or 8.8.8.8. An IP means
         | something else... it means a resource.
         | 
         | On the "client" side we have customer NAT's. CG NAT's and
         | VPN's. An IP means similarly little.
         | 
         | The IP's are really expensive, so in some cases there is a
         | strong advantage to save them. Take a look at
         | https://blog.cloudflare.com/addressing-agility/
         | 
         | "So, test we did. From a /20 address set, to a /24 and then,
         | from June 2021, to an address set of one /32, and equivalently
         | a /128 (Ao1). It doesn't just work. It really works"
         | 
         | We're able to serve "all cloudflare" from /32.
         | 
         | There is this whole trend of getting denser and denser IP
         | usage. It's not avoidable. It's not "breaking the Internet" in
         | any way more than "NAT's are breaking the Internet". The
         | network evolves, because it has to. And for one, I don't think
         | this is inherently bad.
        
           | superkuh wrote:
           | >It's not avoidable. It's not "breaking the Internet" in any
           | way more than "NAT's are breaking the Internet".
           | 
           | I agree. NATs, particularly the Carrier NAT that smartphone
           | users are behind, has broken the internet. It's made it so
           | most people do not have ports and cannot participate in the
           | internet. So now software developers cannot write software
           | that uses the internet (without depending on third parties).
           | This is bad. So is what you've done.
           | 
           | Someday ipv6 will save us.
        
             | [deleted]
        
           | ignoramous wrote:
           | > _And for one, I don 't think this is inherently bad._
           | 
           | Ao1 has super nice censorship resistance properties. With DoH
           | + ECH, that essentially is game over for most firewalls.
           | Can't wait to see just how Cloudflare rolls Ao1 out (I'd
           | imagine it'd be opt-in, like _Green Compute_ ).
           | 
           | > _Then it was realized a single entity might have multiple
           | network interfaces, and an IP started to point to a network
           | card on a host_
           | 
           | Much to Nagle's chagrin:
           | https://news.ycombinator.com/item?id=21088736 (a
           | _Berkeleyism_ as he calls it)
        
       | BlueTemplar wrote:
       | Related :
       | 
       | Deploying IPv6-mostly access networks :
       | https://news.ycombinator.com/item?id=33694293
       | 
       | (I can't for the life of me find the comment chain where this
       | link was posted ~4 days ago ??)
       | 
       | RFC 8925 - IPv6-Only Preferred Option for DHCPv4 (2020) :
       | https://news.ycombinator.com/item?id=33697978 (posted on HN by
       | me)
        
       | remram wrote:
       | TLDR:
       | 
       | > To avoid geofencing issues, we need to choose specific egress
       | addresses tagged with an appropriate country, depending on WARP
       | user location. (...) Instead of having one or two egress IP
       | addresses for each server, now we require dozens, and IPv4
       | addresses aren't cheap.
       | 
       | > Instead of assigning one /32 IPv4 address for each server, we
       | devised a method of assigning a /32 IP per data center, and then
       | sharing it among physical servers (...) splitting an egress IP
       | across servers by a port range.
        
         | majke wrote:
         | Ha, I guess this is one way of summarizing it :) Author here. I
         | wanted to share more subtleties of the design, but maybe I
         | failed.
         | 
         | Indeed, the starting point is sharing IP's across servers with
         | port-ranges.
         | 
         | But there is more:
         | 
         | * awesome performance allowed by anycast.
         | 
         | * ability to route /32 instead of /24 per datacenter.
         | 
         | Generally, with this tech we can have _much_ better IP usage
         | density, without sacrificing reliability or performance. You
         | can call it  "global anycast-based stateless NAT" but that
         | often implies some magic router configuration, which we don't
         | have.
         | 
         | Here's one example of problems we run into - the lack of
         | connectx() syscall on Linux - makes it hard to actually select
         | port range to originate connections from:
         | 
         | https://blog.cloudflare.com/how-to-stop-running-out-of-ephem...
        
           | chatmasta wrote:
           | I was surprised IPv6 was only briefly mentioned! Is that
           | something you're looking at next, or are you already running
           | an IPv6 egress network?
           | 
           | Of course not every destination is an IPv6 host, so IPv4
           | remains necessary, but at least IPv6 can avoid the need for
           | port slicing, since you can encode the same bucketing
           | information in the IP address itself.
           | 
           | I've seen this idea used as a cool trick [0] to implement a
           | SOCKS proxy that randomizes outbound IPv6 address to be
           | within a publicly routed prefix for the host (commonly a
           | /64).
           | 
           | I guess as long as you need to support IPv4, then port
           | slicing is a requirement and IPv6 won't confer much benefit.
           | (Maybe it could help alleviate port exhaustion if IPv6
           | addresses can use dynamic ports from any slice?)
           | 
           | Either way, thanks for the blog post, I enjoyed it!
           | 
           | [0] https://github.com/blacklanternsecurity/TREVORproxy
        
             | miyuru wrote:
             | I was also interested to know how this was handled for
             | IPv6, but it was only briefly mentioned.
             | 
             | Probably they didn't need to do much work with IPv6, since
             | half of the post is solving IPv4 exhaustion problems.
        
               | chriscappuccio wrote:
               | Cloudflare wants to make money. The IPv6 features can
               | come second as v6 usage increases.
        
               | dknecht wrote:
               | All of Cloudflare services ship with IPv6 day 1. IPv6 not
               | an issue as we have enough IPv6 for each machine to have
               | own IPs.
        
           | pencilcode wrote:
           | Is the geofencing country level only? So if, using warp, I
           | use trip advisor and go and see nearby restaurants it will
           | have no idea of what city I'm in? Guessing that's not so but
           | wondering how it works
        
             | aeyes wrote:
             | This blog post has some info:
             | https://blog.cloudflare.com/geoexit-improving-warp-user-
             | expe...
             | 
             | Warp uses its own set of egress IPs and their geolocation
             | is close to your real location.
        
           | remram wrote:
           | From your article it seemed that your use of anycast was more
           | accident than feature, due to the limit of BGP prefix sizes.
           | If you could route those IPs to their correct destination,
           | you would, you only go to the closest data center and route
           | again because you have no choice.
           | 
           | Maybe this ends up reducing cost on customers though, because
           | the international transit happens in your backbone network
           | rather than on the internet (customer-side).
        
         | elp wrote:
         | In english they now do carrier grade nat.
        
           | dilyevsky wrote:
           | right? "We don't like distributed NAT" couple paragraphs
           | later - "hey check out this awesome dist NAT implementation".
        
           | cm2187 wrote:
           | well, vanilla NAT really.
        
             | [deleted]
        
       | immibis wrote:
       | This is a horrible way to avoid upgrading the world to IPv6.
        
         | xnyan wrote:
         | The industry will not transition to v6 unless: 1) The cost of
         | not doing so is higher than the cost of sticking with v4.
         | Because of all the numerous clever tricks and products designed
         | to mitigate v4's limitations, the cost argument still favors v4
         | for most people in most situations.
         | 
         | or
         | 
         | 2) We admit that v6 needs to be rethought and rethink it. I
         | understand why v6 does not just increase IP address bits from
         | 32 to 128, but at this point I think everyone has admitted that
         | v6 is simply too difficult for most IT departments to
         | implement. In particular, the complexity of the new assignment
         | schemes like prefix delegation and SLAAC needs to be paired
         | back. Offer a minimum set of features and spin off everything
         | else.
        
           | __turbobrew__ wrote:
           | I think we would have been better off if ipv6 just lengthened
           | the address fields and left it at that.
        
           | anecdotal1 wrote:
           | That and I have PTSD from IPv6 routing to random places being
           | broken which nobody notices because the fallback to v4 is
           | still working ...
        
           | tempnow987 wrote:
           | Spot on with part 2. Ignore all the folks saying IPv6 is
           | "simple" or works well.
           | 
           | I am absolutely no expert, but could get my head around ipv4,
           | but IPv6 - I always end up running into a fuss. I really wish
           | they'd expanded address space to 64 bits, a few other tweaks,
           | and called it good. Maybe call it IPv5? Is there any chance
           | of doing something like this.
           | 
           | So many things that are so trivial or well known in IPv4 are
           | a total nightmare pain with IPv6. Some quick examples:
           | 
           | Internet service providers will happily give you a block of
           | static IPv4 addresses for a price. ATT goes up to a 64 ip
           | address block easily, even on residential. Almost impossible
           | to get a static block of IPv6 in the US.
           | 
           | Let's say you are SMB, you want WAN failover. With IPv4 this
           | is simple. You can either get two blocks of static for your
           | upstream, and route them directly as appropriate to your
           | servers, or go behind a NAT and do a failover option. Whent
           | the failover happens, your internal network is relatively
           | unaffected.
           | 
           | Now try to do this with IPv6? You can't get your static IP's
           | to do direct routing with, and NPT and anything else is a
           | mess, and the latency in having your entire network renumber
           | when the WAN side flaps is stupid and annoying.
           | 
           | In many SMB contexts folks are very used to DHCP, they use it
           | to trigger boot scripts TFTP, zero touch phone provisioning
           | and lots more, pass out time servers and other info and more.
           | The set of end user devices (printers, phones, security
           | cameras, intercoms, industrial iOT) that can be configured
           | and supported with IPv6 is so poor and the complexity is so
           | high.
           | 
           | Not all ISP's offer prefix delegation to end user sites.
           | Because you have an insane minimum subnet size with ipv6 a
           | lot of things that for example you just need two IPs (think
           | separate network for customer premise equipment) now need a
           | 18,446,744,073,709,551,616 addresses.
           | 
           | The GSE debacle means instead of a very large 64 bit address
           | space we got an insane 128 bit address space. Seriously, how
           | about 96 or anything else a bit more reasonable.
           | 
           | Even things like ICMPv6 - if you just let it through the
           | firewall you could be asking for trouble, but blocking it
           | also causes IPv6 problems. Ugh. Oh, it's simpler than IPv4
           | they say.
        
           | ignoramous wrote:
           | > _We admit that v6 needs to be rethought and rethink it...
           | Offer minimum set of features and spin off everything else._
           | 
           | NAT is _that_ IPv6.
        
           | zajio1am wrote:
           | > In particular, the complexity of the new assignment schemes
           | like prefix delegation and SLAAC needs to be paired back.
           | 
           | Prefix delegation to customers is necessary if you want to
           | avoid NAT. IPv4 stayed with delegating just one IP address to
           | customer (household) because everyone used NAT in home
           | routers due to IP address conservation.
           | 
           | SLAAC was introduced because people wanted a simpler
           | alternative to more complex DHCP. That is why it is
           | mandatory, while DHCP is optional.
           | 
           | > but at this point I think everyone has admitted that v6 is
           | simply too difficult for most IT departments to implement.
           | 
           | I do not think that IPv6 is too difficult to implement, i
           | rarely heard this argument. The main reason that there is no
           | transition is that there is no economic nor political
           | incentive to do so for individual organizations, so there is
           | a coordination problem.
        
           | immibis wrote:
           | Both prefix delegation and SLAAC are optional. Use DHCPv6 if
           | you like.
        
             | tempnow987 wrote:
             | Some quick questions.
             | 
             | SLAAC seems to introduce insane churn in the IPv6 of end
             | user devices. SLAAC (vs DHCPv6) seems to struggle in fully
             | configuring an end user device (think DNS servers etc.
             | 
             | As someone who has beat my head on the IPv6 thing for a bit
             | before giving up (I tried to go all IPv6), what is the
             | "proper" way to setup a home IPv6 network with SLAAC and
             | not DHCPv6?
             | 
             | If DHCPv6 for some reason is essentially always required,
             | why not let subsume SLAAC for IP address assignments?
        
               | simoncion wrote:
               | > SLAAC seems to introduce insane churn in the IPv6 of
               | end user devices.
               | 
               | By "insane churn" do you mean "devices generate and
               | allocate new IP addresses for themselves periodically
               | (maybe daily, maybe more frequently)"? If you do, then
               | that's not SLAAC, that's the head-assed thing sometimes
               | known as "IPv6 Privacy Addresses". From what I've seen on
               | Windows, OSX, and Linux, this makes it so that there's
               | one IP that remains constant, and a parade of addresses
               | that get assigned as time marches on. You can disable it
               | on Windows, OSX, and Linux, and I would recommend doing
               | so.
               | 
               | > SLAAC (vs DHCPv6) seems to struggle in fully
               | configuring an end user device (think DNS servers etc.
               | 
               | Yeah, if you're interested in only using SLAAC, then the
               | best you can do is set the `RDNSS` option [0] in your
               | Router Advertisements and pray that the network
               | configurator in the OS you're using has bothered to pay
               | attention to it.
               | 
               | [0] <https://www.rfc-editor.org/rfc/rfc8106#section-5.1>
               | (Do note that despite the date on this RFC, this option
               | was first specified in 2007, and first specified in a
               | non-experimental RFC in 2010... so, it's not like it's
               | new.)
        
               | immibis wrote:
               | Any sane home router should make it Just Work. All the
               | router has to do is advertise which address prefix your
               | ISP gave you, and your device just chooses an address
               | under that prefix. It couldn't really be simpler - even
               | DHCP is more complex.
               | 
               | However, it only handles your address. Apparently there's
               | an extension to make it also provide DNS addresses and so
               | on. If you don't have that, then I guess you configure
               | DNS manually. Or use 8.8.8.8.
               | 
               | It's not like your router is doing some magic DNS auto-
               | discovery, by the way. It just tells devices the
               | addresses that someone typed into its own configuration
               | page.
        
               | greyface- wrote:
               | > what is the "proper" way to setup a home IPv6 network
               | with SLAAC and not DHCPv6?
               | 
               | RFC 8106 IPv6 Router Advertisement Options for DNS
               | Configuration
               | 
               | https://www.rfc-editor.org/rfc/rfc8106
        
             | xnyan wrote:
             | If they are not needed for core functionally, why are they
             | part of the standard? It's not just IT departments not able
             | to grok v6 that's the problem, IPv6 implementations are a
             | total crapshoot in terms of quality. Some are great, a
             | small few are terrible or straight up don't work at all,
             | and way too many perform perfectly in common situations but
             | blow up in an uncommon edge case. A significant number of
             | devices (especially in industrial, healthcare and
             | scientific fields) still don't support it at all. This both
             | is caused by and perpetuates lack of adoption on the user
             | side.
             | 
             | Reduction in scope and complexity makes it easier to
             | implement high quality IPv6 stacks. We also need more high-
             | quality reference implementations available for newer and
             | underserved platforms, but that's a different subject.
        
         | Animats wrote:
         | I'm surprised that Cloudflare isn't all IPv6 when Cloudflare is
         | the client. That would solve their address problems. Maybe
         | charge more if your servers can't talk IPv6. Or require it for
         | the free tier.
         | 
         | It's useful that they use client side certificates. (They call
         | this "authenticated origin pull", but it seems to be client
         | side certs.
        
           | ec109685 wrote:
           | They also have to egress to third party servers since they
           | are a CDN and support things like serverless functions
        
           | amluto wrote:
           | Sadly Cloudflare seems to treat client certificates as an
           | optional nifty feature as opposed to the critical feature
           | that it is. And even some of the settings that look secure
           | aren't:
           | 
           | https://medium.com/@ss23/leveraging-cloudflares-
           | authenticate...
           | 
           | Authenticated origin pulls should not be "useful". They
           | should be on and configured securely by default, and any
           | insecure setting should get a loud warning.
        
         | LinkLink wrote:
         | It is easier to convince a group of geniuses of a grand idea,
         | than to convince an average person of changing one thing.
        
       ___________________________________________________________________
       (page generated 2022-11-26 23:02 UTC)