[HN Gopher] Speed, scale and reliability: 25 years of Google dat...
___________________________________________________________________
Speed, scale and reliability: 25 years of Google datacenter
networking evolution
Author : sandwichsphinx
Score : 236 points
Date : 2024-11-03 04:29 UTC (18 hours ago)
(HTM) web link (cloud.google.com)
(TXT) w3m dump (cloud.google.com)
| alex_young wrote:
| Like most discussions of the last 25 years, this one starts 9
| years ago. Good times.
| eru wrote:
| The Further Resources section goes a bit further back.
| jerzmacow wrote:
| Wow and it doesn't open with a picture of their lego server?
| Wasn't that their first one, 25 years ago?
| teractiveodular wrote:
| It's a marketing piece, they don't particularly want to
| emphasize the hacky early days for an audience of Serious
| Enterprise Customers.
| DeathArrow wrote:
| It seems all cutting edge datacenters like x.ai Colossus are
| using Nvidia networking. Now Google is upgrading to Nvidia
| networking, too.
|
| Since Nvidia owns most of the Gpgpu products, they have top notch
| networking and interconnect, I wonder if they don't have a plan
| to own all datacenter hardware in the future. Maybe they plan to
| also release CPUs, motherboards, storage and whatever else is
| needed.
| Kab1r wrote:
| Grace Hopper already includes Arm based CPUs (and reference
| motherboards)
| timenova wrote:
| That was their plan with trying to buy ARM...
| danpalmer wrote:
| I read this slightly differently, that specific machine types
| with Nvidia GPU hardware also have Nvidia networking for tying
| together those GPUs.
|
| Google has its own TPUs and don't really use GPUs except to
| sell them to end customers on cloud I think. So using Nvidia
| networking for Nvidia GPUs across many machines on cloud is
| really just a reflection of what external customers want to
| buy.
|
| Disclaimer, I work at Google but have no non public info about
| this.
| dmacedo wrote:
| Having just worked with some of the Thread folks at M&S,
| thought I'd reach out and say hello. Seems like it was an
| awesome team! (=
| danpalmer wrote:
| You're lucky to be working with them, an amazing team.
| adrian_b wrote:
| Nvidia networking is what used to be called Mellanox
| networking, which was already dominant in datacenters.
| immibis wrote:
| Only within supercomputers (including the smaller GPU ones
| used to train AI). Normal data centers use Cisco or Juniper
| or similarly.well known Ethernet equipment, and they still
| do. The Mellanox/Nvidia Infiniband networks are specifically
| used for supercomputer-like clusters.
| wbl wrote:
| Mellanox Ethernet NIC got used a bunch of places due to
| better programmability.
| dafugg wrote:
| You seem to have a narrow definition of "normal" for
| datacenters. Meta were using OCP mellanox NICs for common
| hardware platforms a decade ago and still are.
| HDThoreaun wrote:
| I have to wonder if Nvidia has reached a point where it
| hesitates to develop new products because it would hurt their
| margins. Sure they could probably release a profitable
| networking product but if they did their net margins would
| decrease even as profit increased. This may actually hurt their
| market cap as investors absolutely love high margins.
| eru wrote:
| They can always release capital back to investors, and then
| those investors can put the money into different companies
| that eg produce networking equipment.
| thrw42A8N wrote:
| Why would they release money if they can invest it and
| return much more?
| mikeyouse wrote:
| Yeah there's a bit of industry worry about that very
| eventuality -- hence the ultra Ethernet consortium trying to
| work on open source alternatives to the mellanox/nvidia lock-
| in.
|
| https://ultraethernet.org/
| ravetcofx wrote:
| Interesting Nvidia is on the steering committee
| jonas21 wrote:
| I believe this is what they plan on doing, see:
|
| https://www.youtube.com/live/Y2F8yisiS6E?si=GbyzzIG8w-mtS7s-...
| cletus wrote:
| This mentions Jupiter generations, which I think is about 10-15
| years old at this point. It doesn't really talk about what
| existed before so it's not really 25 years of history here. I
| want to say "Watchtower" was before Jupiter? but honestly it's
| been about a decade since I read anything about it.
|
| Google's DC networking is interesting because of how deeply
| integrated it is into the entire software stack. Click on some of
| the links and you'll see it mentions SDN (Software Defined
| Network). This is so Borg instances can talk to each other within
| the same service at high throughput and low latency. 8-10 years
| ago this was (IIRC) 40Gbps connections. It's probably 100Gbps now
| but that's just a guess.
|
| But the networking is also integrated into global services like
| traffic management to handle, say, DDoS attacks.
|
| Anyway, from reading this it doesn't sound like Google is
| abandoning their custom TPU silicon (ie it talks about the
| upcoming A3 Ultra and Trillium). So where does NVidia ConnectX
| fit in? AFAICT that's just the NIC they're plugging into Jupiter.
| That's probably what enables (or will enable) 100Gbps connections
| between servers. Yes, 100GbE optical NICs have existed for a long
| time. I would assume that NVidia produce better ones in terms of
| price, performance, size, power usage and/or heat produced.
|
| Disclaimer: Xoogler. I didn't work in networking though.
| virtuallynathan wrote:
| This latest revision of Jupiter is apparently 400G, as is the
| ConnectX-7, A3 Ultra will have 8 of them!
| neomantra wrote:
| Nvidia got ConnectX from their Mellanox acquisition -- they
| were experts in RMDA, particularly with Infiniband but
| eventually pushing Ethernet (RoCE). These NICs have hardware-
| acceleration of RDMA. Over the RDMA fabric, GPUs can
| communicate with each other without much CPU usage (the "GPU-
| to-GPU" mentioned in the article).
|
| [I know nothing about Jupiter, and little about RDMA in
| practice, but used ConnectX for VMA, its hardware-accelerated,
| kernel-bypass tech.]
| ceph_ wrote:
| From memory: Firehose > Watchtower > WCC > SCC > Jupiter v1
| CBLT wrote:
| I would guess the Nvidia ConnectX is part of a secondary
| networking plane, not plugged into Jupiter. Current-gen Google
| NICs are custom hardware with a _lot_ of Google-specific
| functionality, such as running the borglet on the NIC to free
| up all CPU cores for guests.
| cavisne wrote:
| The past few years there has been a weird situation where
| Google and AWS have had worse GPU's than smaller providers like
| Coreweave + Lambda Labs. This is because they didn't want to
| buy into Nvidias proprietary Infiniband stack for GPU-GPU
| networking, and instead wanted to make it work on top of their
| ethernet (but still pretty proprietary) stack.
|
| The outcome was really bad GPU-GPU latency & bandwidth between
| machines. My understanding is ConnectX is Nvidias supported
| (and probably still very profitable) way for these hyperscalers
| to use their proprietary networks without buying Infiniband
| switches and without paying the latency cost of moving bytes
| from the GPU to the CPU.
| latchkey wrote:
| Your understanding is correct. Part of the other issue is
| that at one point, there was a huge shortage of availability
| of IB switches... lead times of 1+ years... another solution
| had to be found.
|
| RoCE is IB over Ethernet. All the underlying documentation
| and settings to put this stuff together are the same. It
| doesn't require ConnectX NIC's though. We do the same with 8x
| Broadcom Thor 2 NIC's (into a Broadcom Tomahawk 5 based Dell
| Z9864F switch) for our own 400G cluster.
| maz1b wrote:
| Pretty crazy. Supporting 1.5mbps video calls for each human on
| earth? Did I read that right?
|
| Just goes to show how drastic and extraordinary levels of scale
| can be.
| sethammons wrote:
| Scale means different things to different people
| 486sx33 wrote:
| The most amazing surveillance machine ever ...
| belter wrote:
| Awesome Google... Now learn what an availability zone is and stop
| creating them with firewalls across the same data center.
|
| Oh and make your data centers smaller. Not so big they can be
| seen in Google Maps. Because otherwise, you will be unable to
| move those whale sized workloads to an alternative.
|
| https://youtu.be/mDNHK-SzXEM?t=564
|
| https://news.ycombinator.com/item?id=35713001
|
| "Unmasking Google Cloud: How to Determine if a Region Supports
| Physical Zone Separation" -
| https://cagataygurturk.medium.com/unmasking-google-cloud-how...
| tecleandor wrote:
| Making a datacenter not visible from Google Maps, at least on
| most big cities where Google zones are deployed, would mean
| making them smaller than a car. Or even smaller than a
| dishwasher.
|
| If I check London (where europe-west2 is kinda located) on
| Google Maps right now, I can easily discern manhole covers or
| people. If I check Jakarta (Asia-southeast2) things smaller
| than a car get confusing, but you can definitely see them.
| belter wrote:
| Your comment does not address the essence of the point I was
| trying to make. If you have a monstrous data-center, instead
| of many smaller, in relative size, you are putting too many
| eggs on a giant basket.
| joshuamorton wrote:
| What if you have dozens of big data centers?
| jiggawatts wrote:
| To reinforce your point:
|
| The scale of cloud data centres reflects the scale of
| their customer base, not the size of the basket for each
| individual customer.
|
| Larger data centres actually improve availability through
| several mechanisms: more power components such as
| generators means the failure of any one is just a few
| percent instead of a total blackout. You can also
| partition core infrastructure like routers and power
| rails into more fault domains and update domains.
|
| Some large clouds have two update domains and five fault
| domains _on top of_ three zones that are more than 10km
| apart. You can't beat ~30 individual partitions with your
| data centres at a reasonable cost!
| belter wrote:
| I provided three different references. Despite the
| massive downvotes on my comment I guess by Google
| engineers, as a troll...:-)I take comfort on the fact
| nobody was able to advance a reference to prove me wrong.
| joshuamorton wrote:
| You haven't actually made an argument.
|
| It is true that the nomenclature "AWS Availability Zone"
| has a different meaning than "GCP Zone" when discussing
| the physical separation between zones within the same
| region.
|
| It's unclear why this is inherently a bad thing, as long
| as them same overall level of reliability is achieved.
| belter wrote:
| The phrase "as long as the same overall level of
| reliability is achieved" is logically flawed when
| discussing physically co-located vs. geographically
| separated infrastructure.
| joshuamorton wrote:
| Justify that claim.
|
| In my experience, the set of issues that would affect 2
| buildings close to each other, but not two buildings a
| mile apart, is vanishingly small, usually _just_ last
| mile fiber cuts or power issues (which are rare and
| mitigated by having multiple independent providers), as
| well as issues like building fires (which are exceedingly
| rare, we know of, perhaps two of notable impact in more
| than a decade across the big three cloud providers).
|
| Everything else is done at the zone level no matter what
| (onsite repair work, rollouts, upgrades, control plane
| changes, etc.) or can impact an entire region (non-last
| mile fiber or power cuts, inclement weather, regional
| power starvation, etc.)
|
| There is a potential gain from physical zone isolation,
| but it protects against a relatively small set of issues.
| Is it really better to invest in that, or to invest the
| resources in other safety improvements?
| traceroute66 wrote:
| > Oh and make your data centers smaller. Not so big they can be
| seen in Google Maps.
|
| I'll let you into a secret. Well, its not really a secret
| because everyone in the industry knows it....
|
| All the cloud providers, big and small, they use the same
| third-party colocation sites as everybody else.
|
| Sure, the big-boys have a few of their own sites. But that is
| mostly in the US, and mostly because they got some tax break in
| some state for building a datacentre in the middle of nowhere.
| But in reality you can count the wholly-owned sites on one
| hand, one-an-a-half hands at a push.
|
| In most countries outside the US however, all the cloud
| providers are in the same colo sites as you and I. You will not
| see their kit because they rent whole floors rather than racks
| or cages. But trust me, they are there, all of them, in the
| same building as you. AWS, Microsoft, Google all in the same
| building.
|
| So that's why (some of) the sites you see on Google Maps are so
| big. Because they are colocation campuses used by many
| customers, cloud and corporate.
| anewplace wrote:
| This isn't even close to true. You can just go on Google Maps
| and visually see the literally *hundreds* of wholly-owned and
| custom built data centers from AWS, MS, and Google. Edge
| locations (like Cloud CDN) are often in colos, but the main
| regions compute/storage are not. Most of them are even
| labeled on Google Maps.
|
| Here's a couple search terms you can just type into Google
| Maps and see a small fraction of what I mean:
|
| - "Google Data Center Berkeley County"
|
| - "Microsoft Data Center Boydton"
|
| - "GXO council bluffs" (two locations will appear, both are
| GCP data centers)
|
| - "Google Data Center - Henderson"
|
| - "Microsoft - DB5 Datacentre" (this one is in Dublin, and is
| huuuuuge)
|
| - "Meta Datacenter Clonee"
|
| - "Google Data Center (New Albany)" (just to the east of this
| one is a massive Meta data center campus, and to the
| immediate east of it is a Microsoft data center campus under
| construction)
|
| And that's just a small sample. There are hundreds of these
| sites across the US. You're somewhat right that a lot of
| international locations are colocated in places like Equinix
| data centers, but even then it's not all of them and varies
| by country (for example in Dublin they mostly all have their
| own buildings, not colo). If you know where to look and what
| the buildings look like, the custom-build and self-owned data
| centers from the big cloud providers are easy to spot since
| they all have their own custom design.
| traceroute66 wrote:
| > This isn't even close to true.
|
| Bullshit. Yes it is 100% true.
|
| Why ?
|
| Because I work at many of these sites in many countries in
| my region.
|
| I walk past the offices where the Google staff sit.
|
| I bump into Amazon staff in the elevators and the customer
| break rooms.
|
| I walk past the Microsoft suites with their very particular
| security setup.
|
| All in the same buildings as $myCo's cages and racks.
|
| I'm not going to go into further detail for obvious
| security reasons, but when I say "trust me", I mean it.
| anewplace wrote:
| Yea, you're not the only "insider" here. And you're 100%
| wrong. Just because you completely misunderstand what
| those Amazon/MS employees are doing in those buildings
| doesn't mean that you know what you're talking about.
|
| The big cloud players have the _vast_ majority of their
| compute and storage hosted out of their own custom built
| and self-owned data centers. The stuff you see in colos
| is just the edge locations like Cloudfront and Cloud CDN,
| or the new-ish offerings like AWS Local Zones (which are
| a mix between self-owned and colo, depending on how large
| the local zone is).
|
| Most of this is publicly available by just reading sites
| like datacenterdynamics.com regularly, btw. No insider
| knowledge needed.
| traceroute66 wrote:
| > Yea, you're not the only "insider" here. And you're
| 100% wrong.
|
| Let's just say you don't know who I am and who I have
| spoken to about it.
|
| Those people are very senior, and work under positions
| where by definition of their job role they absolutely
| know what's going on, and so they are under NDA and so
| I'm not going to go into any further detail whatsoever.
|
| Those people have confirmed what I have observed.
|
| And no, before you try to say it .... I'm NOT talking
| about security staff or the on-site cloud-staff I
| mentioned before. I'm talking about pay-grades many many
| many layers higher.
| anewplace wrote:
| Well those people lied to you then, or more likely there
| was a misunderstanding, because you can literally just
| look up the sites I mentioned above and see that you're
| entirely incorrect.
|
| You don't need to be under NDA to see the hundreds of
| _billions_ of dollars worth of custom built and self-
| owned data centers that the big players have.
|
| Hell, you can literally just look at their public
| websites:
| https://www.google.com/about/datacenters/locations/
|
| I am one of those "pay grades many layers higher", and I
| can personally confirm that each of the locations above
| is wholly owned and used by Google, and only Google,
| which already invalidates your claim that "you can count
| the wholly-owned sites on one hand". Again, this isn't
| secret info, so I have no issue sharing it.
| traceroute66 wrote:
| > Well those people lied to you then, or more likely
| there was a misunderstanding
|
| You are crossing into the personal insult side chum ....
|
| I suggest you stop it right there.
|
| I don't know you. You don't know me.
|
| But what I do know is I was not lied to and there was no
| misunderstanding. Because I'm not relying on what one
| person or one company told me, my facts have been
| diligently and discretely cross-checked. I've worked in
| the industry long enough, I wasn't born yesterday...
|
| To add to the list of people I was NOT talking to, you
| can add people like greasy sales reps who might have
| reason to lie. I'm not stupid.
|
| By implication you are also insinuating multiple very
| senior people would collude to lie to me when they don't
| even work for the same company. I think even you would
| agree that's a nuts allegation to make ?
|
| You have no clue about the who, what, where and when of
| my discussions.
|
| You are trying to make me divulge details publicly of who
| I spoke to etc. which I'm not going to go into.
| Especially with a random Joe I don't know and certainly
| not on a public space such as this. End of story.
| anewplace wrote:
| I'm not trying to make you divulge anything. I don't
| particularly care who you talk to, or who you are, nor do
| I care if you take it as a "personal insult" that you
| might be wrong.
|
| You are right that it would be nuts that multiple senior
| people would collude to lie to you, which is why it's
| almost certainly more likely that you are just
| misunderstanding the information that was provided to
| you. It's possible to prove that you are incorrect based
| on publicly available data from multiple different
| sources. You can keep being stubborn if you want, but
| that won't make any of your statements correct.
|
| You didn't ask for my advice, but I'll give it anyway:
| try to be more open to the possibility that you're wrong,
| especially when evidence that you're wrong is right in
| front of you. End of story.
| pknomad wrote:
| Ignore him.
|
| He's got a tinfoil hat on and won't be persuaded..
|
| > Because I'm not relying on what one person or one
| company told me, my facts have been diligently and
| discretely cross-checked.
|
| "Discretely cross-checked" already tells me he chooses to
| live in his own reality.
| dekhn wrote:
| I assume they are referring to PoPs, or other locations
| where large providers house frontends and other smaller
| resources.
| joshuamorton wrote:
| It is true that every cloud provider uses some edge/colo
| infra, but it is also not true that most (or even really
| any relevant) processing happens in those colo/edge
| locations.
|
| Google lists their dc locations publicly:
| https://www.google.com/about/datacenters/locations/
|
| Aws doesn't list the campuses as publicly, but
| https://aws.amazon.com/about-aws/global-
| infrastructure/regio... shows the AZ vs edge deployments
| and any situation with multiple AZs is going to have
| _buildings_ , not floors operated by Amazon.
|
| And limiting to just outside the US, both aws and Google
| have more than ten wholly owned campuses each, and then
| on top of that, there is edge/colo space.
| toast0 wrote:
| It can be true that all the big clouds/cdns/websites are
| in all the big colos _and_ that big tech also has many
| owned and operated sites elsewhere.
|
| As one of these big companies. You've got to be in the
| big colos because that's where you interconnect and peer.
| You don't want to have a full datacenter installation at
| one of these places if you can avoid it, because costs
| are high; but building your own has a long timetable, so
| it makes sense to put things into colos from time to time
| and of course, things get entrenched.
|
| I've seen datacenter lists when I worked at Yahoo and
| Facebook, and it was a mix of small installations at
| PoPs, larger installations at commercial colo facilities,
| and owned and operated data centers. Usually new large
| installations were owned and operated, but it took a long
| time to move out of commercial colos too. And then
| there's also whole building leases, from companies that
| specialize in that. Outside the US, there was more likely
| hood of being in commercial colo, I think because of
| logistics, but at large system counts, the dollar
| efficiency of running it yourself becomes more appealing
| (assuming land, electricity, and fiber are available)
| dangoodmanUT wrote:
| Does gcp have the worst networking for gpu training though?
| dweekly wrote:
| For TPU pods they use 3D torus topology with multi-terabit
| cross connects. For GPU, A3 Ultra instances offer "non-blocking
| 3.2 Tbps per server of GPU-to-GPU traffic over RoCE".
|
| Is that the worst for training? Namely: do superior solutions
| exist?
| ksec wrote:
| They managed to double from 6 Petabit per second in 2022 to 13
| Pbps in 2023. I assume with ConnectX-8 this could be 26 Pbps in
| 2025/26. The ConnextX-8 is PCI-e 6 so I assume we could get
| 1.6Tbps ConnextX-9 with PCI-e 7.0 which is not far away.
|
| Cant wait to see the FreeBSD Netflix version of that post.
|
| This also goes back to how increasing throughput is relatively
| easy and has a very strong roadmap. While increasing storage is
| difficult. I notice YouTube has been serving higher bitrate video
| in recent years with H.264. Instead of storing yet another copy
| of video files in VP9 or AV1 unless they are 2K+.
| reaperducer wrote:
| _Speed, scale and reliability_
|
| Choose any two.
| teractiveodular wrote:
| Which of those is Google's network missing?
___________________________________________________________________
(page generated 2024-11-03 23:00 UTC)