[HN Gopher] Is it time to replace TCP in data centers?
___________________________________________________________________
Is it time to replace TCP in data centers?
Author : zdw
Score : 39 points
Date : 2023-02-20 19:19 UTC (1 days ago)
(HTM) web link (blog.ipspace.net)
(TXT) w3m dump (blog.ipspace.net)
| manv1 wrote:
| "Is it time to spend hundreds of billions of dollars on new
| networking equipment for marginal gains? Pundits say yes."
| wmf wrote:
| Homa uses existing equipment.
| dokem wrote:
| This is our punishment for violating the osi model. Why should
| network gear need to know about above the ip layer?
| manv1 wrote:
| The OSI model was standardized in 1984. The first TCP RFC
| came out in 1980, with versions existing in real life before
| that.
|
| The OSI model isn't, and has never been, the be-all end-all
| of networking models, and it's bizarre to me that people
| continue to believe that the OSI is the only valid model to
| follow.
|
| Maybe you can explain why you believe that the OSI model is
| canonical?
| dokem wrote:
| What other model(s) are you referring to? TCP/IP? that
| still has identical ip/transport separation so can you
| explain why it's a good idea to so blantly violate
| fundamental layers of abstraction?
| omani wrote:
| Betteridge's law of headlines.
| WesolyKubeczek wrote:
| There has been a few of TCP would-be killers out there, SCTP and
| DCCP come to mind. And yet, here we are.
|
| I even remember that there have been Linux kernel security
| vulnerabilities in those protocol somehow exploitable even when
| you don't actively use them, so... I don't compile my own kernels
| often these days, but when I do, I usually say N to any and all
| protocols I'm not likely to be using in my remaining life.
| likeabbas wrote:
| Doesn't HTTP3 solve this with QUIC?
| shanemhansen wrote:
| QUIC could be a candidate, RPC over UDP/IP is definitely the
| direction proposed by Ousterhout. He has a few specific TCP
| things he's listed as being undesirable and frankly I'm not
| sure how QUIC stacks up on.
|
| - Bandwidth sharing ("fair" scheduling) - Sender-driven
| congestion control
|
| See: https://arxiv.org/pdf/2210.00714.pdf (pdf download)
| KaiserPro wrote:
| No, HTTP3 is not a message passing protocol. Its a proto remote
| file access protocol with a rich data channel designed by
| people who really didn't want it to be a file access protocol.
| exabrial wrote:
| It does until you try to tcpdump :/
|
| 1) It's not TCP
|
| 2) It's a PITA on a 'trusted' network because there is not
| "encryption off" mode
| olodus wrote:
| Those are very solvable problems. I work with non-TCP traffic
| in pcaps very often and while I agree encrypted traffic is
| annoying to deal with sometimes because of the extra step it
| can be solved with good tooling.
| zamadatix wrote:
| tcpdump works fine with it, and any other ethernet or IP
| based protocol despite the commands name, it's just 2) that's
| a problem (for certain perspectives of what is a problem).
| erik_seaberg wrote:
| As long as governments attack networks with impunity, it's a
| mistake to trust them.
| shawnz wrote:
| See this well-known slide from the PRISM leaks as a
| reminder of this: https://upload.wikimedia.org/wikipedia/co
| mmons/f/f2/NSA_Musc...
| KaiserPro wrote:
| Its a question of choosing when and how you encrypt.
|
| For internal datacentre traffic, it depends on the value of
| the data. But you will need to do application level
| encryption.
|
| For links between datacentres, encrypting links based on
| rules is a sensible and not horrifically challenging thing
| to do. If you're big enough to have to worry about the
| volume of encrypted traffic, then you have rules based
| bandwidth priorities.
|
| for virtually everyone else, just use wireguard or some
| other VPN with encryption.
| shanemhansen wrote:
| This seems like a review of John Ousterhout's work w/ Homa. I
| highly recommend reading the original.
| https://arxiv.org/abs/2210.00714
|
| For those who don't know Ousterhout created the first log
| structured filesystem and created TCL (used heavily in hardware
| verification but also forming the backbone of the some of the
| first large web servers: aolserver). I was actually surprised to
| find out he co-founded a company with the current CTO of
| Cloudflare. https://en.wikipedia.org/wiki/John_Ousterhout
|
| He has both a candidate replacement as well as benchmarks showing
| 40% performance improvements with grpc on homa compared to grpc
| on TCP. https://github.com/PlatformLab/grpc_homa
|
| With that in mind I think nobody will replace TCP and I doubt
| anything not IP compatible will be able to get off the ground.
| His argument is essentially that for low latency RPC protocols
| TCP is a bad choice.
|
| We've already seen people build a number of similar systems on
| UDP including HTTP replacements that have delivered value for
| clients doing lots of parallel requests on the WAN.
|
| I think many big tech companies are essentially already choosing
| to bypass TCP. I recall facebook doing alot of work with memcache
| on udp. I can't find any public docs on whether or not Google's
| internal RPC uses TCP.
|
| I wouldn't be surprised at all if in the near future something
| like grpc/capnproto/twirp/etc had an in-datacenter TCP-free fast
| path. It would be cool if it was built on Homa.
| wpietri wrote:
| Looking at the Ousterhout paper, I share your skepticism. I'm
| perfectly willing to believe that a parallel universe with
| datacenters running Homa would be a better world.
|
| But if I were to make a list of actual reasons too much money
| is getting burned on AWS bills, I suspect TCP inefficiencies
| wouldn't make the top 10. And worse, some of the things that do
| make the list, like bad managers, rushed schedules, under-
| trained developers, changing technical fashions, and
| "enterprise" development culture, all are huge barriers to
| replacing something so deep in the stack.
| ignoramous wrote:
| > _With that in mind I think nobody will replace TCP_
|
| _Within_ the data center? AWS uses _SRD_ with custom built
| network cards [0]. I 'd be surprised if Microsoft, Facebook,
| and Google aren't doing something similar.
|
| [0] https://ieeexplore.ieee.org/document/9167399 /
| https://archive.is/qZGdC
| opportune wrote:
| The problem is not just TCP vs not-TCP. It's using TCP/HTTP to
| transmit data in JSON, vs picking a stack that is more
| optimized for service-to-service communication in a datacenter.
|
| I am willing to wager that most organizations' microservices
| spend the majority of their CPU usage doing serialization and
| deserialization of JSON.
| jiggawatts wrote:
| Inefficient RPC compounds the common error of "SELECT * FROM
| BigTable". I regularly see multi-hundred-megabyte result sets
| bloat out to a gigabyte on the wire
|
| Bizarrely, this take just a couple of seconds to transfer
| over 10 GbE so many devs simply don't notice or chalk it up
| to "needing more capacity".
|
| Yes, yes, it's the stingy sysops hoarding the precious
| compute that's to blame...
| hinkley wrote:
| I know people who've tried to fix this from time to time
| but it always seems to go wrong.
|
| We could track expected response size, but then every
| feature launch triggers a bunch of alerts which either
| causes the expenditure of social capital, or results in
| alert fatigue which causes us to miss real problems, or
| both.
|
| This is a place where telemetry does particularly well. I
| don't need to be vigilant to regressions every CD cycle,
| every day, or even every week. Most times if I catch a
| problem in a couple of weeks, and I can trace it back to
| the source, that's sufficient to keep the wheels on.
| teknopaul wrote:
| N.B. And are quite happy doing so because it makes app
| develop a breeze.
|
| Being able to use tools like tcpdump to debug applications is
| important to fast problem resolving.
|
| Unless everything you do is "Web scale" and development costs
| are insignificant, simple paradims like stream oriented text
| protocols will have their place.
| tyingq wrote:
| Ousterhout was also one of the co-authors of the Raft consensus
| paper.
| KaiserPro wrote:
| > We've already seen people build a number of similar systems
| on UDP including HTTP replacements that have delivered value
| for clients doing lots of parallel requests on the WAN.
|
| HTTP isn't low latency. Spending all that effort porting
| everything to a discreet message based network, only to have
| HTTP semantics running over the top is a massive own goal.
|
| as for WAN, thats a whole different kettle of fish. You need to
| deal with huge latencies, whole integer percentage of packet
| loss, and lots of other wierd shit.
|
| In that instance you need to create a protocol to tailor to
| your needs. Are you after bandwidth efficiency, or raw speed?
| or do you need to minimise latency? All three things require
| completely different layouts and tradeoffs.
|
| I used to work for a company that shipped TBs from NZ & Aus to
| LA via london. We used to use aspera, but thats expensive. So
| we made a tool that used hundreds of TCP connections to max out
| our network link.
|
| For specialist applications, I can see a need for something
| like homa. But for 95% of datacenter traffic, its just not
| worth the effort.
|
| That paper is also fundamentally flawed, as the OP has rightly
| pointed out. Just because Ousterhout is clever, doesn't make
| him right.
| killingtime74 wrote:
| He also wrote one of my favorite software books
| https://web.stanford.edu/~ouster/cgi-bin/book.php.
|
| Presentation https://youtu.be/bmSAYlu0NcY
| [deleted]
| Animats wrote:
| There are many ways to improve on TCP for specific use cases. You
| can get maybe another 10%-20% of performance, as Google did with
| QUIC. Used outside that use case, performance may be worse. The
| design goal of TCP was that it should Just Work. It needed to be
| interoperable between different vendors, provide reasonably good
| performance, work over a wide range of networks speeds, work over
| possibly flaky networks, continue to perform under overload, and
| not have manual tuning parameters. It does well against those
| criteria.
|
| This new protocol the sort of thing Osterholt is known for. He
| came up with "log" file systems, which are optimized for the case
| where disk writes outnumber reads. Reading then requires more
| seeks on mechanical drives. If you're mostly writing logs that
| are seldom read, it's a win. If you're doing more reading than
| writing, it's a lose.
| nickdothutton wrote:
| I see we're doing this again. I last worked on whole-DC designs
| around 1999 with Sun, Intel, and of course... Mellanox since we
| were to use RDMA/InfiniBand. The plan was to embed InfiniBand
| semiconductors on disks, RAM (individual sticks or trays of
| sticks) and CPUs. There was to be an orchestration/provisioning
| layer allowing you to reserve a composable machine (CPU(s), RAM,
| storage, other IO, etc) at will, and some kind of supervisory
| layer was to ensure that these things were as local as possible
| to one-another in any given rack, aisle, floor.
| Traubenfuchs wrote:
| So you were pretty much working on a more hardware based
| predecessor of kubernetes?
|
| Why did it not work out?
| al2o3cr wrote:
| Some big problems that jump out:
|
| * Infiniband hardware didn't get cheaper as fast as RAM +
| disk
|
| * Infiniband hardware didn't get faster as quickly as
| locally-attached RAM + disk
| SgtBastard wrote:
| It wasn't written in Go </snark>
|
| I'd also love to hear more about the concept.
| bobleeswagger wrote:
| > Why did it not work out?
|
| Software won, it seems.
| KaiserPro wrote:
| kinda, It was cheaper to use fibrechannel/ethernet and
| plain old sparc/intel machines
|
| This was about the time that grid engines promised
| mainframe computing without the cost. Only that it required
| people to understand how to parallelise their workloads.
| That and modify programmes to run in on many machines at
| once.
| KaiserPro wrote:
| > Why did it not work out?
|
| 1) cost
|
| 2) software support
|
| 3) cost
| jmclnx wrote:
| I say no, and I do not see how this could ever happen. Just look
| at IPv4 to IPv6 to see an example on how this would go :)
| Dylan16807 wrote:
| Anything with packets would still be based on normal IP so you
| wouldn't have those issues.
|
| In the worst case, you just build on top of UDP and don't care
| that you're wasting four bytes.
| paxys wrote:
| Unless the alternative is a 100% backwards compatible drop-in
| replacement for TCP, even starting to discuss it is pointless.
| And if it is compatible at that level, it will just bring with it
| all the pitfalls that we want to replace in the first place.
|
| The TCP/IP stack works because I don't need to care about what
| environment the two processes that need to communicate are
| running in. They could both be on my local machine, or in my home
| network, or communicating over the internet, or some random
| intranet, in a data center, across continents, on any OS or any
| kind of device...it simply does not matter. "Just do a ground-up
| rewrite of your entire software stack and you'll get a guaranteed
| 5% efficiency gain" isn't the bullet proof argument that people
| who come up with these alternatives seem to think.
| mike_d wrote:
| Suggested alternative title: "Is it time to replace oxygen in
| the atmosphere?"
| foota wrote:
| This is only really true of external connections, for orgs with
| large degrees of control over their internal stack this can
| make sense.
| ithkuil wrote:
| I wonder if we need to replace TCP in the data centers or if
| what data centers really need is RPC (which ia covered well
| with QUIC)
| StillBored wrote:
| Right, I've been saying this for a while, it seems that many
| of the hyperscalers/etc made the mistake of trying to utilize
| TCP (and maybe HTTP/etc) for their (generic) RPC mechanisms.
| Which is hardly the best plan, when what is actually needed
| is just a reliable datagram protocol without a bunch of
| connection state.
|
| So, while the actual RPC protocol has its own issues, the one
| thing they got right was the ability for the portmapper to
| indicate UDP vs TCP as the transport on a service basis.
| There have been a few improved generic RPC mechanisms, and I
| don't really understand why some of these places feel the
| need to "replace TCP" when really what they need is a more
| formalized RPC mechanism that can set/detect the datagram
| reliability and pick varying levels of protocol retry/etc as
| needed.
| readingnews wrote:
| Correct. I agree. Well said.
| zokier wrote:
| Is quic 100% backwards compatible drop-in replacement for tcp?
| No. Is it pointless to discuss. Also no.
|
| And DC applications are far more easy to switch over than
| general internet, less middleboxes to screw you over, generally
| better performing networks, and more tightly controlled hosts.
|
| Honestly, it wouldn't be all that farfetched to have AWS
| implement QUIC over SRD for squeezing the last perf drops out.
| nick0garvey wrote:
| I don't agree. A subset of high performance applications in a
| datacenter can use a new protocol while still supporting TCP
| for other applications. I'm not saying this is easy, or even
| worth it, but it isn't all or nothing.
| teknopaul wrote:
| I think the point is: Goals of "A subset of high performance
| applications" is not going to _replace_ TCP unless its works
| OK for everyone else.
|
| Additional M.O. protocol cool, but replace TCP with it?
| bjackman wrote:
| I think the parent commenter would be pretty surprised how
| much code would really need a rewrite. Google and Amazon do
| not have hundreds of thousands of engineers messing around in
| the mud with sockets and connections and IP addresses and
| DNS. There's a service mesh and a standardised RPC framework.
| You say "get me the X service" and you make RPCs to it.
| Whether they're transported over HTTP, UDP, unix domain
| sockets, or local function calls is fully abstracted.
| hinkley wrote:
| If you need to go that fast why not implement a layer 2
| protocol?
|
| The point of these abstractions is that they are insurance.
| We pay taxes on best case scenarios all the time in order
| to avoid or clamp worst case scenarios. When industries
| start chasing that last 5% by gambling on removing
| resiliency, that usually ends poorly for the rest of us.
| See also train lines in the US.
| dietr1ch wrote:
| Do they need to replace a lot of code?
|
| I'd suspect that the tcp bits are hidden in the rpc layer
| anyway, be it grpc or whatever
| kevin_thibedeau wrote:
| IPPROTO_SCTP does this today. Just have to convince
| middleware/box vendors to support internet protocol.
| jeroenhd wrote:
| You'll also need to convince several operating system
| maintainers to build/optimise their implementation first.
|
| I wouldn't have any trouble ignoring middlebox software (or
| adding a TCP fallback with a big warning that something
| suspicious is interfering with the connection) but Windows
| and macOS still lack proper SCTP support, Linux' SCTP support
| has some performance issues and usermode raw sockets will
| probably need to bypass several OS sandboxes to be viable.
|
| That said, in server to server connections SCTP can probably
| be used just fine.
| hinkley wrote:
| Where this conversation might have some points would be that
| you don't typically have these middle boxes within a single
| data center, so you could _try_ to ignore them.
|
| Until you remember that you have datacenters that need to
| talk to each other and then this strategy doesn't work.
| CountSessine wrote:
| ...and Microsoft and Apple if your endpoints include Windows,
| IOS or MacOS. All of those have 3rd party drivers but I don't
| think any of them come with SCTP support out of the box.
| noselasd wrote:
| It's a no-go if they did anyway at the moment, the number
| of endpoints that's behind a NAT box that only knows about
| UDP and TCP is huge.
| LinuxBender wrote:
| _Is It Time to Replace TCP in Data Centers?_
|
| Probably not? Maybe? One would have to either integrate support
| for new protocol into every server, switch, router, IoT oh and of
| course all the 3rd party clouds and clients one is speaking to.
| _OR_ everything leaving the datacenter would have to be dual-
| stack and /or funneled through some WAN optimizer/gateway/proxy
| device that can translate new protocol into TCP/IP creating a
| single point of success bottle-neck. Dual-stack brings up some
| security issues that this protocol will have to address not to
| mention more cabling complexity.
|
| I think the best place to start this conversation would be with
| architects at Microsoft, IBM/Redhat, maybe even Meta since they
| acquired several kernel developers and let them see how cost
| effective the gains are. If the big players buy into this and
| they have kernel developers that can integrate seamless support
| into Linux and Microsoft to start with and a few big players try
| it out then maybe it would share the same market cap as
| Infiniband. Let them deploy a proof of concept pod for free for a
| year and see what they can do with it. I think this would have to
| be successful first before other vendors start adding support for
| new protocols. If the plan is to have one vendor to rule them all
| then it will not succeed as there would be no competition and it
| would be too expensive for mass adoption. This would end up being
| another proprietary thing that IBM or some other big company
| acquires and sits on.
|
| At least that is my opinion based on my experience deploying SAN
| switches, proprietary memory inter-connect buses, proprietary
| storage and storage clustering, proprietary mini-mainframes and
| server clusters. Speed improvements will impress technical people
| but businesses ultimately look into TCO/ROI and reliability.
| Complexity is a factor in reliability and the ability to hire
| people to support said new thing. If anything I have seen
| datacenters going the opposite direction; that is, keeping things
| as generic as possible and using open source solutions to scale
| first to their vertical limits and then horizontally. Ceph is a
| great example of this.
| blibble wrote:
| you can run a new protocol on top of IP "relatively" easily: it
| routes fine across the internet as switches and routers don't
| care about TCP[1] (switches don't even care about IP)
|
| linux will even let you do it from userspace (with net_admin
| cap)
|
| there are some exceptions: like if your endpoint is some a
| crappy cloud provider (e.g. azure) that provides you something
| that looks like an ethernet network but really is a glorified
| TCP/unicast UDP proxy
|
| [1]: ignoring NAT and IGMP/MLD snooping (not that anyone does
| those outside of internal networks... right?)
| xxpor wrote:
| routers care about tcp (or udp) because of flow hashing.
| write a new IP proto and see what happens when you try to
| push >100 gbit
| blibble wrote:
| you have 3 out of 5 parts of the standard tuple
|
| not perfect but not unworkable either
| olodus wrote:
| Since one of the suggestions the article brings up is QUIC then
| won't it happen almost automatically? Atleast the distributed
| systems I work with usually have http between their parts (a few
| with some other RPC solution). Then when the switch happens to
| http 3 QUIC will come with it. There will probably be some
| problems along the way sure and will take time for people to tune
| thibgs, but that must be expected right?
___________________________________________________________________
(page generated 2023-02-21 23:01 UTC)