[HN Gopher] Is it time to replace TCP in data centers?
       ___________________________________________________________________
        
       Is it time to replace TCP in data centers?
        
       Author : zdw
       Score  : 39 points
       Date   : 2023-02-20 19:19 UTC (1 days ago)
        
 (HTM) web link (blog.ipspace.net)
 (TXT) w3m dump (blog.ipspace.net)
        
       | manv1 wrote:
       | "Is it time to spend hundreds of billions of dollars on new
       | networking equipment for marginal gains? Pundits say yes."
        
         | wmf wrote:
         | Homa uses existing equipment.
        
         | dokem wrote:
         | This is our punishment for violating the osi model. Why should
         | network gear need to know about above the ip layer?
        
           | manv1 wrote:
           | The OSI model was standardized in 1984. The first TCP RFC
           | came out in 1980, with versions existing in real life before
           | that.
           | 
           | The OSI model isn't, and has never been, the be-all end-all
           | of networking models, and it's bizarre to me that people
           | continue to believe that the OSI is the only valid model to
           | follow.
           | 
           | Maybe you can explain why you believe that the OSI model is
           | canonical?
        
             | dokem wrote:
             | What other model(s) are you referring to? TCP/IP? that
             | still has identical ip/transport separation so can you
             | explain why it's a good idea to so blantly violate
             | fundamental layers of abstraction?
        
       | omani wrote:
       | Betteridge's law of headlines.
        
       | WesolyKubeczek wrote:
       | There has been a few of TCP would-be killers out there, SCTP and
       | DCCP come to mind. And yet, here we are.
       | 
       | I even remember that there have been Linux kernel security
       | vulnerabilities in those protocol somehow exploitable even when
       | you don't actively use them, so... I don't compile my own kernels
       | often these days, but when I do, I usually say N to any and all
       | protocols I'm not likely to be using in my remaining life.
        
       | likeabbas wrote:
       | Doesn't HTTP3 solve this with QUIC?
        
         | shanemhansen wrote:
         | QUIC could be a candidate, RPC over UDP/IP is definitely the
         | direction proposed by Ousterhout. He has a few specific TCP
         | things he's listed as being undesirable and frankly I'm not
         | sure how QUIC stacks up on.
         | 
         | - Bandwidth sharing ("fair" scheduling) - Sender-driven
         | congestion control
         | 
         | See: https://arxiv.org/pdf/2210.00714.pdf (pdf download)
        
         | KaiserPro wrote:
         | No, HTTP3 is not a message passing protocol. Its a proto remote
         | file access protocol with a rich data channel designed by
         | people who really didn't want it to be a file access protocol.
        
         | exabrial wrote:
         | It does until you try to tcpdump :/
         | 
         | 1) It's not TCP
         | 
         | 2) It's a PITA on a 'trusted' network because there is not
         | "encryption off" mode
        
           | olodus wrote:
           | Those are very solvable problems. I work with non-TCP traffic
           | in pcaps very often and while I agree encrypted traffic is
           | annoying to deal with sometimes because of the extra step it
           | can be solved with good tooling.
        
           | zamadatix wrote:
           | tcpdump works fine with it, and any other ethernet or IP
           | based protocol despite the commands name, it's just 2) that's
           | a problem (for certain perspectives of what is a problem).
        
           | erik_seaberg wrote:
           | As long as governments attack networks with impunity, it's a
           | mistake to trust them.
        
             | shawnz wrote:
             | See this well-known slide from the PRISM leaks as a
             | reminder of this: https://upload.wikimedia.org/wikipedia/co
             | mmons/f/f2/NSA_Musc...
        
             | KaiserPro wrote:
             | Its a question of choosing when and how you encrypt.
             | 
             | For internal datacentre traffic, it depends on the value of
             | the data. But you will need to do application level
             | encryption.
             | 
             | For links between datacentres, encrypting links based on
             | rules is a sensible and not horrifically challenging thing
             | to do. If you're big enough to have to worry about the
             | volume of encrypted traffic, then you have rules based
             | bandwidth priorities.
             | 
             | for virtually everyone else, just use wireguard or some
             | other VPN with encryption.
        
       | shanemhansen wrote:
       | This seems like a review of John Ousterhout's work w/ Homa. I
       | highly recommend reading the original.
       | https://arxiv.org/abs/2210.00714
       | 
       | For those who don't know Ousterhout created the first log
       | structured filesystem and created TCL (used heavily in hardware
       | verification but also forming the backbone of the some of the
       | first large web servers: aolserver). I was actually surprised to
       | find out he co-founded a company with the current CTO of
       | Cloudflare. https://en.wikipedia.org/wiki/John_Ousterhout
       | 
       | He has both a candidate replacement as well as benchmarks showing
       | 40% performance improvements with grpc on homa compared to grpc
       | on TCP. https://github.com/PlatformLab/grpc_homa
       | 
       | With that in mind I think nobody will replace TCP and I doubt
       | anything not IP compatible will be able to get off the ground.
       | His argument is essentially that for low latency RPC protocols
       | TCP is a bad choice.
       | 
       | We've already seen people build a number of similar systems on
       | UDP including HTTP replacements that have delivered value for
       | clients doing lots of parallel requests on the WAN.
       | 
       | I think many big tech companies are essentially already choosing
       | to bypass TCP. I recall facebook doing alot of work with memcache
       | on udp. I can't find any public docs on whether or not Google's
       | internal RPC uses TCP.
       | 
       | I wouldn't be surprised at all if in the near future something
       | like grpc/capnproto/twirp/etc had an in-datacenter TCP-free fast
       | path. It would be cool if it was built on Homa.
        
         | wpietri wrote:
         | Looking at the Ousterhout paper, I share your skepticism. I'm
         | perfectly willing to believe that a parallel universe with
         | datacenters running Homa would be a better world.
         | 
         | But if I were to make a list of actual reasons too much money
         | is getting burned on AWS bills, I suspect TCP inefficiencies
         | wouldn't make the top 10. And worse, some of the things that do
         | make the list, like bad managers, rushed schedules, under-
         | trained developers, changing technical fashions, and
         | "enterprise" development culture, all are huge barriers to
         | replacing something so deep in the stack.
        
         | ignoramous wrote:
         | > _With that in mind I think nobody will replace TCP_
         | 
         |  _Within_ the data center? AWS uses _SRD_ with custom built
         | network cards [0]. I 'd be surprised if Microsoft, Facebook,
         | and Google aren't doing something similar.
         | 
         | [0] https://ieeexplore.ieee.org/document/9167399 /
         | https://archive.is/qZGdC
        
         | opportune wrote:
         | The problem is not just TCP vs not-TCP. It's using TCP/HTTP to
         | transmit data in JSON, vs picking a stack that is more
         | optimized for service-to-service communication in a datacenter.
         | 
         | I am willing to wager that most organizations' microservices
         | spend the majority of their CPU usage doing serialization and
         | deserialization of JSON.
        
           | jiggawatts wrote:
           | Inefficient RPC compounds the common error of "SELECT * FROM
           | BigTable". I regularly see multi-hundred-megabyte result sets
           | bloat out to a gigabyte on the wire
           | 
           | Bizarrely, this take just a couple of seconds to transfer
           | over 10 GbE so many devs simply don't notice or chalk it up
           | to "needing more capacity".
           | 
           | Yes, yes, it's the stingy sysops hoarding the precious
           | compute that's to blame...
        
             | hinkley wrote:
             | I know people who've tried to fix this from time to time
             | but it always seems to go wrong.
             | 
             | We could track expected response size, but then every
             | feature launch triggers a bunch of alerts which either
             | causes the expenditure of social capital, or results in
             | alert fatigue which causes us to miss real problems, or
             | both.
             | 
             | This is a place where telemetry does particularly well. I
             | don't need to be vigilant to regressions every CD cycle,
             | every day, or even every week. Most times if I catch a
             | problem in a couple of weeks, and I can trace it back to
             | the source, that's sufficient to keep the wheels on.
        
           | teknopaul wrote:
           | N.B. And are quite happy doing so because it makes app
           | develop a breeze.
           | 
           | Being able to use tools like tcpdump to debug applications is
           | important to fast problem resolving.
           | 
           | Unless everything you do is "Web scale" and development costs
           | are insignificant, simple paradims like stream oriented text
           | protocols will have their place.
        
         | tyingq wrote:
         | Ousterhout was also one of the co-authors of the Raft consensus
         | paper.
        
         | KaiserPro wrote:
         | > We've already seen people build a number of similar systems
         | on UDP including HTTP replacements that have delivered value
         | for clients doing lots of parallel requests on the WAN.
         | 
         | HTTP isn't low latency. Spending all that effort porting
         | everything to a discreet message based network, only to have
         | HTTP semantics running over the top is a massive own goal.
         | 
         | as for WAN, thats a whole different kettle of fish. You need to
         | deal with huge latencies, whole integer percentage of packet
         | loss, and lots of other wierd shit.
         | 
         | In that instance you need to create a protocol to tailor to
         | your needs. Are you after bandwidth efficiency, or raw speed?
         | or do you need to minimise latency? All three things require
         | completely different layouts and tradeoffs.
         | 
         | I used to work for a company that shipped TBs from NZ & Aus to
         | LA via london. We used to use aspera, but thats expensive. So
         | we made a tool that used hundreds of TCP connections to max out
         | our network link.
         | 
         | For specialist applications, I can see a need for something
         | like homa. But for 95% of datacenter traffic, its just not
         | worth the effort.
         | 
         | That paper is also fundamentally flawed, as the OP has rightly
         | pointed out. Just because Ousterhout is clever, doesn't make
         | him right.
        
         | killingtime74 wrote:
         | He also wrote one of my favorite software books
         | https://web.stanford.edu/~ouster/cgi-bin/book.php.
         | 
         | Presentation https://youtu.be/bmSAYlu0NcY
        
       | [deleted]
        
       | Animats wrote:
       | There are many ways to improve on TCP for specific use cases. You
       | can get maybe another 10%-20% of performance, as Google did with
       | QUIC. Used outside that use case, performance may be worse. The
       | design goal of TCP was that it should Just Work. It needed to be
       | interoperable between different vendors, provide reasonably good
       | performance, work over a wide range of networks speeds, work over
       | possibly flaky networks, continue to perform under overload, and
       | not have manual tuning parameters. It does well against those
       | criteria.
       | 
       | This new protocol the sort of thing Osterholt is known for. He
       | came up with "log" file systems, which are optimized for the case
       | where disk writes outnumber reads. Reading then requires more
       | seeks on mechanical drives. If you're mostly writing logs that
       | are seldom read, it's a win. If you're doing more reading than
       | writing, it's a lose.
        
       | nickdothutton wrote:
       | I see we're doing this again. I last worked on whole-DC designs
       | around 1999 with Sun, Intel, and of course... Mellanox since we
       | were to use RDMA/InfiniBand. The plan was to embed InfiniBand
       | semiconductors on disks, RAM (individual sticks or trays of
       | sticks) and CPUs. There was to be an orchestration/provisioning
       | layer allowing you to reserve a composable machine (CPU(s), RAM,
       | storage, other IO, etc) at will, and some kind of supervisory
       | layer was to ensure that these things were as local as possible
       | to one-another in any given rack, aisle, floor.
        
         | Traubenfuchs wrote:
         | So you were pretty much working on a more hardware based
         | predecessor of kubernetes?
         | 
         | Why did it not work out?
        
           | al2o3cr wrote:
           | Some big problems that jump out:
           | 
           | * Infiniband hardware didn't get cheaper as fast as RAM +
           | disk
           | 
           | * Infiniband hardware didn't get faster as quickly as
           | locally-attached RAM + disk
        
           | SgtBastard wrote:
           | It wasn't written in Go </snark>
           | 
           | I'd also love to hear more about the concept.
        
           | bobleeswagger wrote:
           | > Why did it not work out?
           | 
           | Software won, it seems.
        
             | KaiserPro wrote:
             | kinda, It was cheaper to use fibrechannel/ethernet and
             | plain old sparc/intel machines
             | 
             | This was about the time that grid engines promised
             | mainframe computing without the cost. Only that it required
             | people to understand how to parallelise their workloads.
             | That and modify programmes to run in on many machines at
             | once.
        
           | KaiserPro wrote:
           | > Why did it not work out?
           | 
           | 1) cost
           | 
           | 2) software support
           | 
           | 3) cost
        
       | jmclnx wrote:
       | I say no, and I do not see how this could ever happen. Just look
       | at IPv4 to IPv6 to see an example on how this would go :)
        
         | Dylan16807 wrote:
         | Anything with packets would still be based on normal IP so you
         | wouldn't have those issues.
         | 
         | In the worst case, you just build on top of UDP and don't care
         | that you're wasting four bytes.
        
       | paxys wrote:
       | Unless the alternative is a 100% backwards compatible drop-in
       | replacement for TCP, even starting to discuss it is pointless.
       | And if it is compatible at that level, it will just bring with it
       | all the pitfalls that we want to replace in the first place.
       | 
       | The TCP/IP stack works because I don't need to care about what
       | environment the two processes that need to communicate are
       | running in. They could both be on my local machine, or in my home
       | network, or communicating over the internet, or some random
       | intranet, in a data center, across continents, on any OS or any
       | kind of device...it simply does not matter. "Just do a ground-up
       | rewrite of your entire software stack and you'll get a guaranteed
       | 5% efficiency gain" isn't the bullet proof argument that people
       | who come up with these alternatives seem to think.
        
         | mike_d wrote:
         | Suggested alternative title: "Is it time to replace oxygen in
         | the atmosphere?"
        
         | foota wrote:
         | This is only really true of external connections, for orgs with
         | large degrees of control over their internal stack this can
         | make sense.
        
         | ithkuil wrote:
         | I wonder if we need to replace TCP in the data centers or if
         | what data centers really need is RPC (which ia covered well
         | with QUIC)
        
           | StillBored wrote:
           | Right, I've been saying this for a while, it seems that many
           | of the hyperscalers/etc made the mistake of trying to utilize
           | TCP (and maybe HTTP/etc) for their (generic) RPC mechanisms.
           | Which is hardly the best plan, when what is actually needed
           | is just a reliable datagram protocol without a bunch of
           | connection state.
           | 
           | So, while the actual RPC protocol has its own issues, the one
           | thing they got right was the ability for the portmapper to
           | indicate UDP vs TCP as the transport on a service basis.
           | There have been a few improved generic RPC mechanisms, and I
           | don't really understand why some of these places feel the
           | need to "replace TCP" when really what they need is a more
           | formalized RPC mechanism that can set/detect the datagram
           | reliability and pick varying levels of protocol retry/etc as
           | needed.
        
         | readingnews wrote:
         | Correct. I agree. Well said.
        
         | zokier wrote:
         | Is quic 100% backwards compatible drop-in replacement for tcp?
         | No. Is it pointless to discuss. Also no.
         | 
         | And DC applications are far more easy to switch over than
         | general internet, less middleboxes to screw you over, generally
         | better performing networks, and more tightly controlled hosts.
         | 
         | Honestly, it wouldn't be all that farfetched to have AWS
         | implement QUIC over SRD for squeezing the last perf drops out.
        
         | nick0garvey wrote:
         | I don't agree. A subset of high performance applications in a
         | datacenter can use a new protocol while still supporting TCP
         | for other applications. I'm not saying this is easy, or even
         | worth it, but it isn't all or nothing.
        
           | teknopaul wrote:
           | I think the point is: Goals of "A subset of high performance
           | applications" is not going to _replace_ TCP unless its works
           | OK for everyone else.
           | 
           | Additional M.O. protocol cool, but replace TCP with it?
        
           | bjackman wrote:
           | I think the parent commenter would be pretty surprised how
           | much code would really need a rewrite. Google and Amazon do
           | not have hundreds of thousands of engineers messing around in
           | the mud with sockets and connections and IP addresses and
           | DNS. There's a service mesh and a standardised RPC framework.
           | You say "get me the X service" and you make RPCs to it.
           | Whether they're transported over HTTP, UDP, unix domain
           | sockets, or local function calls is fully abstracted.
        
             | hinkley wrote:
             | If you need to go that fast why not implement a layer 2
             | protocol?
             | 
             | The point of these abstractions is that they are insurance.
             | We pay taxes on best case scenarios all the time in order
             | to avoid or clamp worst case scenarios. When industries
             | start chasing that last 5% by gambling on removing
             | resiliency, that usually ends poorly for the rest of us.
             | See also train lines in the US.
        
             | dietr1ch wrote:
             | Do they need to replace a lot of code?
             | 
             | I'd suspect that the tcp bits are hidden in the rpc layer
             | anyway, be it grpc or whatever
        
         | kevin_thibedeau wrote:
         | IPPROTO_SCTP does this today. Just have to convince
         | middleware/box vendors to support internet protocol.
        
           | jeroenhd wrote:
           | You'll also need to convince several operating system
           | maintainers to build/optimise their implementation first.
           | 
           | I wouldn't have any trouble ignoring middlebox software (or
           | adding a TCP fallback with a big warning that something
           | suspicious is interfering with the connection) but Windows
           | and macOS still lack proper SCTP support, Linux' SCTP support
           | has some performance issues and usermode raw sockets will
           | probably need to bypass several OS sandboxes to be viable.
           | 
           | That said, in server to server connections SCTP can probably
           | be used just fine.
        
           | hinkley wrote:
           | Where this conversation might have some points would be that
           | you don't typically have these middle boxes within a single
           | data center, so you could _try_ to ignore them.
           | 
           | Until you remember that you have datacenters that need to
           | talk to each other and then this strategy doesn't work.
        
           | CountSessine wrote:
           | ...and Microsoft and Apple if your endpoints include Windows,
           | IOS or MacOS. All of those have 3rd party drivers but I don't
           | think any of them come with SCTP support out of the box.
        
             | noselasd wrote:
             | It's a no-go if they did anyway at the moment, the number
             | of endpoints that's behind a NAT box that only knows about
             | UDP and TCP is huge.
        
       | LinuxBender wrote:
       | _Is It Time to Replace TCP in Data Centers?_
       | 
       | Probably not? Maybe? One would have to either integrate support
       | for new protocol into every server, switch, router, IoT oh and of
       | course all the 3rd party clouds and clients one is speaking to.
       | _OR_ everything leaving the datacenter would have to be dual-
       | stack and /or funneled through some WAN optimizer/gateway/proxy
       | device that can translate new protocol into TCP/IP creating a
       | single point of success bottle-neck. Dual-stack brings up some
       | security issues that this protocol will have to address not to
       | mention more cabling complexity.
       | 
       | I think the best place to start this conversation would be with
       | architects at Microsoft, IBM/Redhat, maybe even Meta since they
       | acquired several kernel developers and let them see how cost
       | effective the gains are. If the big players buy into this and
       | they have kernel developers that can integrate seamless support
       | into Linux and Microsoft to start with and a few big players try
       | it out then maybe it would share the same market cap as
       | Infiniband. Let them deploy a proof of concept pod for free for a
       | year and see what they can do with it. I think this would have to
       | be successful first before other vendors start adding support for
       | new protocols. If the plan is to have one vendor to rule them all
       | then it will not succeed as there would be no competition and it
       | would be too expensive for mass adoption. This would end up being
       | another proprietary thing that IBM or some other big company
       | acquires and sits on.
       | 
       | At least that is my opinion based on my experience deploying SAN
       | switches, proprietary memory inter-connect buses, proprietary
       | storage and storage clustering, proprietary mini-mainframes and
       | server clusters. Speed improvements will impress technical people
       | but businesses ultimately look into TCO/ROI and reliability.
       | Complexity is a factor in reliability and the ability to hire
       | people to support said new thing. If anything I have seen
       | datacenters going the opposite direction; that is, keeping things
       | as generic as possible and using open source solutions to scale
       | first to their vertical limits and then horizontally. Ceph is a
       | great example of this.
        
         | blibble wrote:
         | you can run a new protocol on top of IP "relatively" easily: it
         | routes fine across the internet as switches and routers don't
         | care about TCP[1] (switches don't even care about IP)
         | 
         | linux will even let you do it from userspace (with net_admin
         | cap)
         | 
         | there are some exceptions: like if your endpoint is some a
         | crappy cloud provider (e.g. azure) that provides you something
         | that looks like an ethernet network but really is a glorified
         | TCP/unicast UDP proxy
         | 
         | [1]: ignoring NAT and IGMP/MLD snooping (not that anyone does
         | those outside of internal networks... right?)
        
           | xxpor wrote:
           | routers care about tcp (or udp) because of flow hashing.
           | write a new IP proto and see what happens when you try to
           | push >100 gbit
        
             | blibble wrote:
             | you have 3 out of 5 parts of the standard tuple
             | 
             | not perfect but not unworkable either
        
       | olodus wrote:
       | Since one of the suggestions the article brings up is QUIC then
       | won't it happen almost automatically? Atleast the distributed
       | systems I work with usually have http between their parts (a few
       | with some other RPC solution). Then when the switch happens to
       | http 3 QUIC will come with it. There will probably be some
       | problems along the way sure and will take time for people to tune
       | thibgs, but that must be expected right?
        
       ___________________________________________________________________
       (page generated 2023-02-21 23:01 UTC)