hngopher.com

       [HN Gopher] Speed, scale and reliability: 25 years of Google dat...
       ___________________________________________________________________
        
       Speed, scale and reliability: 25 years of Google datacenter
       networking evolution
        
       Author : sandwichsphinx
       Score  : 236 points
       Date   : 2024-11-03 04:29 UTC (18 hours ago)
        
 (HTM) web link (cloud.google.com)
 (TXT) w3m dump (cloud.google.com)
        
       | alex_young wrote:
       | Like most discussions of the last 25 years, this one starts 9
       | years ago. Good times.
        
         | eru wrote:
         | The Further Resources section goes a bit further back.
        
       | jerzmacow wrote:
       | Wow and it doesn't open with a picture of their lego server?
       | Wasn't that their first one, 25 years ago?
        
         | teractiveodular wrote:
         | It's a marketing piece, they don't particularly want to
         | emphasize the hacky early days for an audience of Serious
         | Enterprise Customers.
        
       | DeathArrow wrote:
       | It seems all cutting edge datacenters like x.ai Colossus are
       | using Nvidia networking. Now Google is upgrading to Nvidia
       | networking, too.
       | 
       | Since Nvidia owns most of the Gpgpu products, they have top notch
       | networking and interconnect, I wonder if they don't have a plan
       | to own all datacenter hardware in the future. Maybe they plan to
       | also release CPUs, motherboards, storage and whatever else is
       | needed.
        
         | Kab1r wrote:
         | Grace Hopper already includes Arm based CPUs (and reference
         | motherboards)
        
         | timenova wrote:
         | That was their plan with trying to buy ARM...
        
         | danpalmer wrote:
         | I read this slightly differently, that specific machine types
         | with Nvidia GPU hardware also have Nvidia networking for tying
         | together those GPUs.
         | 
         | Google has its own TPUs and don't really use GPUs except to
         | sell them to end customers on cloud I think. So using Nvidia
         | networking for Nvidia GPUs across many machines on cloud is
         | really just a reflection of what external customers want to
         | buy.
         | 
         | Disclaimer, I work at Google but have no non public info about
         | this.
        
           | dmacedo wrote:
           | Having just worked with some of the Thread folks at M&S,
           | thought I'd reach out and say hello. Seems like it was an
           | awesome team! (=
        
             | danpalmer wrote:
             | You're lucky to be working with them, an amazing team.
        
         | adrian_b wrote:
         | Nvidia networking is what used to be called Mellanox
         | networking, which was already dominant in datacenters.
        
           | immibis wrote:
           | Only within supercomputers (including the smaller GPU ones
           | used to train AI). Normal data centers use Cisco or Juniper
           | or similarly.well known Ethernet equipment, and they still
           | do. The Mellanox/Nvidia Infiniband networks are specifically
           | used for supercomputer-like clusters.
        
             | wbl wrote:
             | Mellanox Ethernet NIC got used a bunch of places due to
             | better programmability.
        
             | dafugg wrote:
             | You seem to have a narrow definition of "normal" for
             | datacenters. Meta were using OCP mellanox NICs for common
             | hardware platforms a decade ago and still are.
        
         | HDThoreaun wrote:
         | I have to wonder if Nvidia has reached a point where it
         | hesitates to develop new products because it would hurt their
         | margins. Sure they could probably release a profitable
         | networking product but if they did their net margins would
         | decrease even as profit increased. This may actually hurt their
         | market cap as investors absolutely love high margins.
        
           | eru wrote:
           | They can always release capital back to investors, and then
           | those investors can put the money into different companies
           | that eg produce networking equipment.
        
             | thrw42A8N wrote:
             | Why would they release money if they can invest it and
             | return much more?
        
         | mikeyouse wrote:
         | Yeah there's a bit of industry worry about that very
         | eventuality -- hence the ultra Ethernet consortium trying to
         | work on open source alternatives to the mellanox/nvidia lock-
         | in.
         | 
         | https://ultraethernet.org/
        
           | ravetcofx wrote:
           | Interesting Nvidia is on the steering committee
        
         | jonas21 wrote:
         | I believe this is what they plan on doing, see:
         | 
         | https://www.youtube.com/live/Y2F8yisiS6E?si=GbyzzIG8w-mtS7s-...
        
       | cletus wrote:
       | This mentions Jupiter generations, which I think is about 10-15
       | years old at this point. It doesn't really talk about what
       | existed before so it's not really 25 years of history here. I
       | want to say "Watchtower" was before Jupiter? but honestly it's
       | been about a decade since I read anything about it.
       | 
       | Google's DC networking is interesting because of how deeply
       | integrated it is into the entire software stack. Click on some of
       | the links and you'll see it mentions SDN (Software Defined
       | Network). This is so Borg instances can talk to each other within
       | the same service at high throughput and low latency. 8-10 years
       | ago this was (IIRC) 40Gbps connections. It's probably 100Gbps now
       | but that's just a guess.
       | 
       | But the networking is also integrated into global services like
       | traffic management to handle, say, DDoS attacks.
       | 
       | Anyway, from reading this it doesn't sound like Google is
       | abandoning their custom TPU silicon (ie it talks about the
       | upcoming A3 Ultra and Trillium). So where does NVidia ConnectX
       | fit in? AFAICT that's just the NIC they're plugging into Jupiter.
       | That's probably what enables (or will enable) 100Gbps connections
       | between servers. Yes, 100GbE optical NICs have existed for a long
       | time. I would assume that NVidia produce better ones in terms of
       | price, performance, size, power usage and/or heat produced.
       | 
       | Disclaimer: Xoogler. I didn't work in networking though.
        
         | virtuallynathan wrote:
         | This latest revision of Jupiter is apparently 400G, as is the
         | ConnectX-7, A3 Ultra will have 8 of them!
        
         | neomantra wrote:
         | Nvidia got ConnectX from their Mellanox acquisition -- they
         | were experts in RMDA, particularly with Infiniband but
         | eventually pushing Ethernet (RoCE). These NICs have hardware-
         | acceleration of RDMA. Over the RDMA fabric, GPUs can
         | communicate with each other without much CPU usage (the "GPU-
         | to-GPU" mentioned in the article).
         | 
         | [I know nothing about Jupiter, and little about RDMA in
         | practice, but used ConnectX for VMA, its hardware-accelerated,
         | kernel-bypass tech.]
        
         | ceph_ wrote:
         | From memory: Firehose > Watchtower > WCC > SCC > Jupiter v1
        
         | CBLT wrote:
         | I would guess the Nvidia ConnectX is part of a secondary
         | networking plane, not plugged into Jupiter. Current-gen Google
         | NICs are custom hardware with a _lot_ of Google-specific
         | functionality, such as running the borglet on the NIC to free
         | up all CPU cores for guests.
        
         | cavisne wrote:
         | The past few years there has been a weird situation where
         | Google and AWS have had worse GPU's than smaller providers like
         | Coreweave + Lambda Labs. This is because they didn't want to
         | buy into Nvidias proprietary Infiniband stack for GPU-GPU
         | networking, and instead wanted to make it work on top of their
         | ethernet (but still pretty proprietary) stack.
         | 
         | The outcome was really bad GPU-GPU latency & bandwidth between
         | machines. My understanding is ConnectX is Nvidias supported
         | (and probably still very profitable) way for these hyperscalers
         | to use their proprietary networks without buying Infiniband
         | switches and without paying the latency cost of moving bytes
         | from the GPU to the CPU.
        
           | latchkey wrote:
           | Your understanding is correct. Part of the other issue is
           | that at one point, there was a huge shortage of availability
           | of IB switches... lead times of 1+ years... another solution
           | had to be found.
           | 
           | RoCE is IB over Ethernet. All the underlying documentation
           | and settings to put this stuff together are the same. It
           | doesn't require ConnectX NIC's though. We do the same with 8x
           | Broadcom Thor 2 NIC's (into a Broadcom Tomahawk 5 based Dell
           | Z9864F switch) for our own 400G cluster.
        
       | maz1b wrote:
       | Pretty crazy. Supporting 1.5mbps video calls for each human on
       | earth? Did I read that right?
       | 
       | Just goes to show how drastic and extraordinary levels of scale
       | can be.
        
         | sethammons wrote:
         | Scale means different things to different people
        
       | 486sx33 wrote:
       | The most amazing surveillance machine ever ...
        
       | belter wrote:
       | Awesome Google... Now learn what an availability zone is and stop
       | creating them with firewalls across the same data center.
       | 
       | Oh and make your data centers smaller. Not so big they can be
       | seen in Google Maps. Because otherwise, you will be unable to
       | move those whale sized workloads to an alternative.
       | 
       | https://youtu.be/mDNHK-SzXEM?t=564
       | 
       | https://news.ycombinator.com/item?id=35713001
       | 
       | "Unmasking Google Cloud: How to Determine if a Region Supports
       | Physical Zone Separation" -
       | https://cagataygurturk.medium.com/unmasking-google-cloud-how...
        
         | tecleandor wrote:
         | Making a datacenter not visible from Google Maps, at least on
         | most big cities where Google zones are deployed, would mean
         | making them smaller than a car. Or even smaller than a
         | dishwasher.
         | 
         | If I check London (where europe-west2 is kinda located) on
         | Google Maps right now, I can easily discern manhole covers or
         | people. If I check Jakarta (Asia-southeast2) things smaller
         | than a car get confusing, but you can definitely see them.
        
           | belter wrote:
           | Your comment does not address the essence of the point I was
           | trying to make. If you have a monstrous data-center, instead
           | of many smaller, in relative size, you are putting too many
           | eggs on a giant basket.
        
             | joshuamorton wrote:
             | What if you have dozens of big data centers?
        
               | jiggawatts wrote:
               | To reinforce your point:
               | 
               | The scale of cloud data centres reflects the scale of
               | their customer base, not the size of the basket for each
               | individual customer.
               | 
               | Larger data centres actually improve availability through
               | several mechanisms: more power components such as
               | generators means the failure of any one is just a few
               | percent instead of a total blackout. You can also
               | partition core infrastructure like routers and power
               | rails into more fault domains and update domains.
               | 
               | Some large clouds have two update domains and five fault
               | domains _on top of_ three zones that are more than 10km
               | apart. You can't beat ~30 individual partitions with your
               | data centres at a reasonable cost!
        
               | belter wrote:
               | I provided three different references. Despite the
               | massive downvotes on my comment I guess by Google
               | engineers, as a troll...:-)I take comfort on the fact
               | nobody was able to advance a reference to prove me wrong.
        
               | joshuamorton wrote:
               | You haven't actually made an argument.
               | 
               | It is true that the nomenclature "AWS Availability Zone"
               | has a different meaning than "GCP Zone" when discussing
               | the physical separation between zones within the same
               | region.
               | 
               | It's unclear why this is inherently a bad thing, as long
               | as them same overall level of reliability is achieved.
        
               | belter wrote:
               | The phrase "as long as the same overall level of
               | reliability is achieved" is logically flawed when
               | discussing physically co-located vs. geographically
               | separated infrastructure.
        
               | joshuamorton wrote:
               | Justify that claim.
               | 
               | In my experience, the set of issues that would affect 2
               | buildings close to each other, but not two buildings a
               | mile apart, is vanishingly small, usually _just_ last
               | mile fiber cuts or power issues (which are rare and
               | mitigated by having multiple independent providers), as
               | well as issues like building fires (which are exceedingly
               | rare, we know of, perhaps two of notable impact in more
               | than a decade across the big three cloud providers).
               | 
               | Everything else is done at the zone level no matter what
               | (onsite repair work, rollouts, upgrades, control plane
               | changes, etc.) or can impact an entire region (non-last
               | mile fiber or power cuts, inclement weather, regional
               | power starvation, etc.)
               | 
               | There is a potential gain from physical zone isolation,
               | but it protects against a relatively small set of issues.
               | Is it really better to invest in that, or to invest the
               | resources in other safety improvements?
        
         | traceroute66 wrote:
         | > Oh and make your data centers smaller. Not so big they can be
         | seen in Google Maps.
         | 
         | I'll let you into a secret. Well, its not really a secret
         | because everyone in the industry knows it....
         | 
         | All the cloud providers, big and small, they use the same
         | third-party colocation sites as everybody else.
         | 
         | Sure, the big-boys have a few of their own sites. But that is
         | mostly in the US, and mostly because they got some tax break in
         | some state for building a datacentre in the middle of nowhere.
         | But in reality you can count the wholly-owned sites on one
         | hand, one-an-a-half hands at a push.
         | 
         | In most countries outside the US however, all the cloud
         | providers are in the same colo sites as you and I. You will not
         | see their kit because they rent whole floors rather than racks
         | or cages. But trust me, they are there, all of them, in the
         | same building as you. AWS, Microsoft, Google all in the same
         | building.
         | 
         | So that's why (some of) the sites you see on Google Maps are so
         | big. Because they are colocation campuses used by many
         | customers, cloud and corporate.
        
           | anewplace wrote:
           | This isn't even close to true. You can just go on Google Maps
           | and visually see the literally *hundreds* of wholly-owned and
           | custom built data centers from AWS, MS, and Google. Edge
           | locations (like Cloud CDN) are often in colos, but the main
           | regions compute/storage are not. Most of them are even
           | labeled on Google Maps.
           | 
           | Here's a couple search terms you can just type into Google
           | Maps and see a small fraction of what I mean:
           | 
           | - "Google Data Center Berkeley County"
           | 
           | - "Microsoft Data Center Boydton"
           | 
           | - "GXO council bluffs" (two locations will appear, both are
           | GCP data centers)
           | 
           | - "Google Data Center - Henderson"
           | 
           | - "Microsoft - DB5 Datacentre" (this one is in Dublin, and is
           | huuuuuge)
           | 
           | - "Meta Datacenter Clonee"
           | 
           | - "Google Data Center (New Albany)" (just to the east of this
           | one is a massive Meta data center campus, and to the
           | immediate east of it is a Microsoft data center campus under
           | construction)
           | 
           | And that's just a small sample. There are hundreds of these
           | sites across the US. You're somewhat right that a lot of
           | international locations are colocated in places like Equinix
           | data centers, but even then it's not all of them and varies
           | by country (for example in Dublin they mostly all have their
           | own buildings, not colo). If you know where to look and what
           | the buildings look like, the custom-build and self-owned data
           | centers from the big cloud providers are easy to spot since
           | they all have their own custom design.
        
             | traceroute66 wrote:
             | > This isn't even close to true.
             | 
             | Bullshit. Yes it is 100% true.
             | 
             | Why ?
             | 
             | Because I work at many of these sites in many countries in
             | my region.
             | 
             | I walk past the offices where the Google staff sit.
             | 
             | I bump into Amazon staff in the elevators and the customer
             | break rooms.
             | 
             | I walk past the Microsoft suites with their very particular
             | security setup.
             | 
             | All in the same buildings as $myCo's cages and racks.
             | 
             | I'm not going to go into further detail for obvious
             | security reasons, but when I say "trust me", I mean it.
        
               | anewplace wrote:
               | Yea, you're not the only "insider" here. And you're 100%
               | wrong. Just because you completely misunderstand what
               | those Amazon/MS employees are doing in those buildings
               | doesn't mean that you know what you're talking about.
               | 
               | The big cloud players have the _vast_ majority of their
               | compute and storage hosted out of their own custom built
               | and self-owned data centers. The stuff you see in colos
               | is just the edge locations like Cloudfront and Cloud CDN,
               | or the new-ish offerings like AWS Local Zones (which are
               | a mix between self-owned and colo, depending on how large
               | the local zone is).
               | 
               | Most of this is publicly available by just reading sites
               | like datacenterdynamics.com regularly, btw. No insider
               | knowledge needed.
        
               | traceroute66 wrote:
               | > Yea, you're not the only "insider" here. And you're
               | 100% wrong.
               | 
               | Let's just say you don't know who I am and who I have
               | spoken to about it.
               | 
               | Those people are very senior, and work under positions
               | where by definition of their job role they absolutely
               | know what's going on, and so they are under NDA and so
               | I'm not going to go into any further detail whatsoever.
               | 
               | Those people have confirmed what I have observed.
               | 
               | And no, before you try to say it .... I'm NOT talking
               | about security staff or the on-site cloud-staff I
               | mentioned before. I'm talking about pay-grades many many
               | many layers higher.
        
               | anewplace wrote:
               | Well those people lied to you then, or more likely there
               | was a misunderstanding, because you can literally just
               | look up the sites I mentioned above and see that you're
               | entirely incorrect.
               | 
               | You don't need to be under NDA to see the hundreds of
               | _billions_ of dollars worth of custom built and self-
               | owned data centers that the big players have.
               | 
               | Hell, you can literally just look at their public
               | websites:
               | https://www.google.com/about/datacenters/locations/
               | 
               | I am one of those "pay grades many layers higher", and I
               | can personally confirm that each of the locations above
               | is wholly owned and used by Google, and only Google,
               | which already invalidates your claim that "you can count
               | the wholly-owned sites on one hand". Again, this isn't
               | secret info, so I have no issue sharing it.
        
               | traceroute66 wrote:
               | > Well those people lied to you then, or more likely
               | there was a misunderstanding
               | 
               | You are crossing into the personal insult side chum ....
               | 
               | I suggest you stop it right there.
               | 
               | I don't know you. You don't know me.
               | 
               | But what I do know is I was not lied to and there was no
               | misunderstanding. Because I'm not relying on what one
               | person or one company told me, my facts have been
               | diligently and discretely cross-checked. I've worked in
               | the industry long enough, I wasn't born yesterday...
               | 
               | To add to the list of people I was NOT talking to, you
               | can add people like greasy sales reps who might have
               | reason to lie. I'm not stupid.
               | 
               | By implication you are also insinuating multiple very
               | senior people would collude to lie to me when they don't
               | even work for the same company. I think even you would
               | agree that's a nuts allegation to make ?
               | 
               | You have no clue about the who, what, where and when of
               | my discussions.
               | 
               | You are trying to make me divulge details publicly of who
               | I spoke to etc. which I'm not going to go into.
               | Especially with a random Joe I don't know and certainly
               | not on a public space such as this. End of story.
        
               | anewplace wrote:
               | I'm not trying to make you divulge anything. I don't
               | particularly care who you talk to, or who you are, nor do
               | I care if you take it as a "personal insult" that you
               | might be wrong.
               | 
               | You are right that it would be nuts that multiple senior
               | people would collude to lie to you, which is why it's
               | almost certainly more likely that you are just
               | misunderstanding the information that was provided to
               | you. It's possible to prove that you are incorrect based
               | on publicly available data from multiple different
               | sources. You can keep being stubborn if you want, but
               | that won't make any of your statements correct.
               | 
               | You didn't ask for my advice, but I'll give it anyway:
               | try to be more open to the possibility that you're wrong,
               | especially when evidence that you're wrong is right in
               | front of you. End of story.
        
               | pknomad wrote:
               | Ignore him.
               | 
               | He's got a tinfoil hat on and won't be persuaded..
               | 
               | > Because I'm not relying on what one person or one
               | company told me, my facts have been diligently and
               | discretely cross-checked.
               | 
               | "Discretely cross-checked" already tells me he chooses to
               | live in his own reality.
        
               | dekhn wrote:
               | I assume they are referring to PoPs, or other locations
               | where large providers house frontends and other smaller
               | resources.
        
               | joshuamorton wrote:
               | It is true that every cloud provider uses some edge/colo
               | infra, but it is also not true that most (or even really
               | any relevant) processing happens in those colo/edge
               | locations.
               | 
               | Google lists their dc locations publicly:
               | https://www.google.com/about/datacenters/locations/
               | 
               | Aws doesn't list the campuses as publicly, but
               | https://aws.amazon.com/about-aws/global-
               | infrastructure/regio... shows the AZ vs edge deployments
               | and any situation with multiple AZs is going to have
               | _buildings_ , not floors operated by Amazon.
               | 
               | And limiting to just outside the US, both aws and Google
               | have more than ten wholly owned campuses each, and then
               | on top of that, there is edge/colo space.
        
               | toast0 wrote:
               | It can be true that all the big clouds/cdns/websites are
               | in all the big colos _and_ that big tech also has many
               | owned and operated sites elsewhere.
               | 
               | As one of these big companies. You've got to be in the
               | big colos because that's where you interconnect and peer.
               | You don't want to have a full datacenter installation at
               | one of these places if you can avoid it, because costs
               | are high; but building your own has a long timetable, so
               | it makes sense to put things into colos from time to time
               | and of course, things get entrenched.
               | 
               | I've seen datacenter lists when I worked at Yahoo and
               | Facebook, and it was a mix of small installations at
               | PoPs, larger installations at commercial colo facilities,
               | and owned and operated data centers. Usually new large
               | installations were owned and operated, but it took a long
               | time to move out of commercial colos too. And then
               | there's also whole building leases, from companies that
               | specialize in that. Outside the US, there was more likely
               | hood of being in commercial colo, I think because of
               | logistics, but at large system counts, the dollar
               | efficiency of running it yourself becomes more appealing
               | (assuming land, electricity, and fiber are available)
        
       | dangoodmanUT wrote:
       | Does gcp have the worst networking for gpu training though?
        
         | dweekly wrote:
         | For TPU pods they use 3D torus topology with multi-terabit
         | cross connects. For GPU, A3 Ultra instances offer "non-blocking
         | 3.2 Tbps per server of GPU-to-GPU traffic over RoCE".
         | 
         | Is that the worst for training? Namely: do superior solutions
         | exist?
        
       | ksec wrote:
       | They managed to double from 6 Petabit per second in 2022 to 13
       | Pbps in 2023. I assume with ConnectX-8 this could be 26 Pbps in
       | 2025/26. The ConnextX-8 is PCI-e 6 so I assume we could get
       | 1.6Tbps ConnextX-9 with PCI-e 7.0 which is not far away.
       | 
       | Cant wait to see the FreeBSD Netflix version of that post.
       | 
       | This also goes back to how increasing throughput is relatively
       | easy and has a very strong roadmap. While increasing storage is
       | difficult. I notice YouTube has been serving higher bitrate video
       | in recent years with H.264. Instead of storing yet another copy
       | of video files in VP9 or AV1 unless they are 2K+.
        
       | reaperducer wrote:
       | _Speed, scale and reliability_
       | 
       | Choose any two.
        
         | teractiveodular wrote:
         | Which of those is Google's network missing?
        
       ___________________________________________________________________
       (page generated 2024-11-03 23:00 UTC)