[HN Gopher] Behind the scenes: Redpanda Cloud's response to the ...
___________________________________________________________________
Behind the scenes: Redpanda Cloud's response to the GCP outage
Author : eatonphil
Score : 74 points
Date : 2025-06-21 14:57 UTC (8 hours ago)
(HTM) web link (www.redpanda.com)
(TXT) w3m dump (www.redpanda.com)
| RadiozRadioz wrote:
| Hmm. Here's what I read from this article: RedPanda didn't happen
| to use any of the stuff in GCP that went down, so they were
| unaffected. They use a 3rd party for alerting and dashboarding,
| and that 3rd party went down, but RedPanda still had their own
| monitoring.
|
| When I read "major outage for a large part of the internet was
| just another normal day for Redpanda Cloud customers", I expected
| a brave tale of RedPanda SREs valiantly fixing things, or some
| cool automatic failover tech. What I got instead was: Google told
| RedPanda there was an issue, RedPanda had a look and their
| service was unaffected, nothing needed failing over, then someone
| at RedPanda wrote an article bragging about their triple-nine
| uptime & fault tolerance.
|
| I get it, an SRE is doing well if you don't notice them, but the
| only real preventative measure I saw here that directly helped
| with this issue, is that they over provision disk space. Which
| I'd be alarmed if they didn't do.
| literallyroy wrote:
| Yeah I thought they were going to show something cool like
| multi-tenant architecture. Odd to write this article when it
| was clear they expected to be impacted as they were reaching
| out to customers.
| dangoodmanUT wrote:
| I think you're missing the point. What I took away was that:
| "Because we design for zero dependencies for full operation,
| we didn't go down". Their extra features like tiered storage
| and monitoring going down didn't affect normal operations,
| which it seems like it did for similar solutions with similar
| features.
| echelon wrote:
| > triple-nine uptime & fault tolerance.
|
| Haha, we used to joke that's how many nines our customer-facing
| Ruby on Rails services had compared against our resilient five
| nines payments systems. Our heavy infra handled billions in
| daily payment volume and couldn't go down.
|
| With the Ruby teams, we often playfully quipped, "which nines
| are those?" humorously implying the leading digit itself wasn't
| itself a nine.
| sokoloff wrote:
| AKA: "We're closing in on our third 8 of uptime..."
| gopher_space wrote:
| > I expected a brave tale of RedPanda SREs valiantly fixing
| things, or some cool automatic failover tech.
|
| It's a tale of how they set things up so they wouldn't need to
| valiantly fix things, and I think the subtext is probably that
| Redpanda doesn't pass responsibility on to a third party.
|
| There are plenty of domains and, more importantly, people who
| need uptime guarantees to mean "fix estimate from a real human
| working on the problem" and not eventual store credit. Payroll
| is a classic example.
| RadiozRadioz wrote:
| Nothing about the way they architected their system even
| mattered in this incident. Their service just wasn't using
| any of the infrastructure that failed - there was no event
| here that actually put their system design to the test. There
| just isn't a story here.
|
| It's like if the power went out in the building next door,
| and you wrote a blog post about how amazing the reliability
| of your office computers are compared to your neighbor. If
| your power had gone out too but you had provisioned a bunch
| of UPSs and been fine, then there's something to talk about.
|
| To extend the analogy, if the neighborhood had a reputation
| for brown-outs and you deliberately chose not to build an
| office there, then maybe you have something. But here,
| RedPanda's GCP offering is inside GCP, this failure in GCP
| has never happened before, they just got lucky.
| bdavbdav wrote:
| "We got lucky as the way we designed it happened not to use the
| part of the service that was degraded"
| smoyer wrote:
| And we're oblivious enough about that luck that we're patting
| ourselves on the back in public.
| belter wrote:
| And we are linking our blog to the AWS doc on cell
| architectures, while talking about multiaz clusters on GCP
| azs that are nothing like that...
| rybosome wrote:
| Must be hell inside GCP right now. That was a big outage, and
| they were tired of big outages years ago. It was already
| extremely difficult to move quickly and get things done due to
| the reliability red tape, and I have to imagine this will make it
| even harder.
| siscia wrote:
| In fairness, their design does not seem to be regional. With
| problems in one region bringing down another, apparently not
| unrelated, region.
|
| With this kind of architecture, this sort of problems is just
| bound to happen.
|
| During my time in AWS, region independence was a must. And some
| services were able to operate at least for a while without
| degrading also when some core dependencies were not available.
| Think like loosing S3.
|
| And after that, the service would keep operating, but with a
| degraded experience.
|
| I am stunned that this level of isolation is not common in GCP.
| valenterry wrote:
| How does AWS do that though? Do the re-implement all the code
| in every region? Because even the slightest re-use of code
| could trigger a synchronous (possibly delayed) downtime of
| all regions.
| crop_rotation wrote:
| Reusing code doesn't trigger region dependencies.
|
| > Do the re-implement all the code in every region?
|
| Everyone does.
|
| The difference is AWS very strongly ensures that regions
| are independent failure domains. The GCP architecture is
| global with all the pros and cons that implies. e.g GCP has
| a truly global load balancer while AWS can not since
| everything is at core regional.
| nijave wrote:
| They definitely roll out code (at least for some services)
| one region at a time. That doesn't prevent old bugs/issues
| from coming up but it definitely helps prevent new ones
| from becoming global outages.
| cyberax wrote:
| Region (and even availability zones) in AWS are
| independent. The regions all have overlapping IPv4
| addresses, so direct cross-region connectivity is
| impossible.
|
| So it's actually really hard to accidentally make cross-
| region calls, if you're working inside the AWS
| infrastructure. The call has to happen over the public
| Internet, and you need a special approval for that.
|
| Deployments also happen gradually, typically only a few
| regions at a time. There's an internal tool that allows
| things to be gradually rolled out and automatically rolled
| back if monitoring detects that something is off.
| rybosome wrote:
| Global dependencies were disallowed back in 2018 with a tiny
| handful of exceptions that were difficult or impossible to
| make fully regional. Chemist, the service that went down, was
| one of those.
|
| Generally GCP wants regionality, but because it offers so
| many higher-level inter-region features, some kind of a
| global layer is basically inevitable.
| dangoodmanUT wrote:
| Does Route53 depend on services in us-east-1 though? Or maybe
| it's something else, but i recall us-east-1 downtime causing
| service downtime for global services
| cyberax wrote:
| As far as I remember, Route53 is semi-regional. The master
| copy is kept in us-east-1, but individual regions have
| replicated data. So if us-east-1 goes down, the individual
| regions will keep working with the last known state.
|
| Amazon calls this "static stability".
| toast0 wrote:
| Static stability is a good start, but isn't enough.
|
| In this outage, my service (on GCP) had static stability,
| which was great. However, some other similar services
| failed, and we got more load, but we couldn't start
| additional instances to handle the load because of the
| outage, and so we had overloaded servers and poor service
| quality.
|
| Mayhaps we could have adjusted load across regions to
| manage instance load, but that's not something we
| normally do.
| flaminHotSpeedo wrote:
| AWS regions are fundamentally different from GCP regions. GCP
| marketing tries really hard to make it seem otherwise, or
| that GCP has all the advantages of AWS regions plus the
| advantages of their approach, which means heavily on
| "effectively global" services. There are tradeoffs, for
| example multi region in GCP is often trivial and GCP can
| enforce fairness across regions, but that comes at the cost
| of availability. Which would be fine - GCP SLA's reflect the
| fact that they rarely consider regions to be a reliable fault
| containers, but GCP marketing, IMO, creates a dangerous
| situation by pretending to be something they aren't.
|
| Even in the mini incident report they were going through
| extreme linguistic gymnastics trying to claim they are
| regional. Describing the service that caused the outage,
| which is responsible for _global quota enforcement_ and is
| configured using a data store that replicates data globally
| in near real time, with apparently no option to delay
| replication, they said: Service Control is a
| regional service that has a regional datastore that it reads
| quota and policy information from. This datastore metadata
| gets replicated almost instantly globally to manage quota
| policies for Google Cloud and our customers.
|
| Not only would AWS call this a global service, the whole
| concept of global quotas would not fly at AWS.
| buremba wrote:
| I think making the identity piece regional hurts the UX a lot.
| I like GCP's approach, where you manage multiple regions with a
| single identity, but I'm not sure how they can make it
| resilient to regional failures.
| nijave wrote:
| Async replication? I think you could run semi independent
| regions with an orchestrator that copies config to each one.
| You'd go into a degraded read only state but it wouldn't be
| hard down.
|
| Of course bugs in the orchestrator could cause outages but
| ideally that piece is a pretty simple "loop over regions and
| call each regional API update method with the same arguments"
| delusional wrote:
| > they were tired of big outages years ago
|
| One could hope that they'd realize whatever red tape they've
| been putting up so far hasn't helped, and so more of it
| probably wont either.
|
| If what you're doing isn't having an effect you need to do
| something different, not just more.
| kubb wrote:
| They'll do more of the same. The leads are clueless and
| sensible voices of criticism are deftly squashed.
| raverbashing wrote:
| Lol I love how they call "not spreading your services needlessly
| across many different servers" as an "Architectural Pattern"
| (Cell based arch)
|
| They are right, of course, but the way things, the obvious needs
| to be said sometimes
| macintux wrote:
| Years ago I had the misfortune of helping a company recover
| from an outage.
|
| It turned out that they had services in two data centers for
| redundancy, but they _divided their critical services between
| them_.
|
| So if either data center went offline, their whole stack was
| dead. Brilliant design. That was a very long week; fortunately
| by now I've forgotten most of it.
| Peterpanzeri wrote:
| "We got lucky as the way we designed it happened not to use the
| part of the service that was degraded" this is a stupid statement
| from them, hope they will be prepared next time
| mankyd wrote:
| Why is that stupid? They did get lucky. They are acknowledging
| that, had they used that, they would have had problems. And now
| they will work to be more prepared.
|
| Acknowledging that one still has risks and that luck plays a
| factor is important.
| beefnugs wrote:
| I learned a lesson : "use less cloud"
| zzyzxd wrote:
| The article is unnecessarily long only to brag about "a service
| we didn't use went down so it didn't affect us". If I want to be
| picky, their architecture is also not perfect:
|
| - Their alerts were not durable. The outage took out the alert
| system so humans were just eyeballing dashboards during the
| outage. What if your critical system went down along with that
| alert system, in the middle of night?
|
| - The cloud marketplace service was affected by cloudflare outage
| and there's nothiing they could do.
|
| - Tiered stroage was down, disk usage went above normal level.
| But there's no anomaly detection and no alerts. It survived
| because t0 storage was massively over provisioned.
|
| - They took pride in using industry well-known designs like cell-
| based architecture, redundancy, multi-az...ChatGPT would be able
| to give me a better list
|
| And I don't get whey they had to roast Crowdstrike at the end. I
| mean, the Crowdstrike incident was really amateur stuff, like,
| the absolute lowest bar I can think of.
| diroussel wrote:
| > Modern computer systems are complex systems -- and complex
| systems are characterized by their non-linear nature, which means
| that observed changes in an output are not proportional to the
| change in the input. This concept is also known in chaos theory
| as the butterfly effect,
|
| This isn't quite right. Linear systems can also be complex, and
| linear dynamic systems can also exhibit the butterfly effect.
|
| That is why the butterfly effect is so interesting.
|
| Of course non-linear systems can have a large change in output
| based on a small input, because they allow step changes, and many
| other non-linear processes.
___________________________________________________________________
(page generated 2025-06-21 23:01 UTC)