[HN Gopher] Behind the scenes: Redpanda Cloud's response to the ...
       ___________________________________________________________________
        
       Behind the scenes: Redpanda Cloud's response to the GCP outage
        
       Author : eatonphil
       Score  : 74 points
       Date   : 2025-06-21 14:57 UTC (8 hours ago)
        
 (HTM) web link (www.redpanda.com)
 (TXT) w3m dump (www.redpanda.com)
        
       | RadiozRadioz wrote:
       | Hmm. Here's what I read from this article: RedPanda didn't happen
       | to use any of the stuff in GCP that went down, so they were
       | unaffected. They use a 3rd party for alerting and dashboarding,
       | and that 3rd party went down, but RedPanda still had their own
       | monitoring.
       | 
       | When I read "major outage for a large part of the internet was
       | just another normal day for Redpanda Cloud customers", I expected
       | a brave tale of RedPanda SREs valiantly fixing things, or some
       | cool automatic failover tech. What I got instead was: Google told
       | RedPanda there was an issue, RedPanda had a look and their
       | service was unaffected, nothing needed failing over, then someone
       | at RedPanda wrote an article bragging about their triple-nine
       | uptime & fault tolerance.
       | 
       | I get it, an SRE is doing well if you don't notice them, but the
       | only real preventative measure I saw here that directly helped
       | with this issue, is that they over provision disk space. Which
       | I'd be alarmed if they didn't do.
        
         | literallyroy wrote:
         | Yeah I thought they were going to show something cool like
         | multi-tenant architecture. Odd to write this article when it
         | was clear they expected to be impacted as they were reaching
         | out to customers.
        
           | dangoodmanUT wrote:
           | I think you're missing the point. What I took away was that:
           | "Because we design for zero dependencies for full operation,
           | we didn't go down". Their extra features like tiered storage
           | and monitoring going down didn't affect normal operations,
           | which it seems like it did for similar solutions with similar
           | features.
        
         | echelon wrote:
         | > triple-nine uptime & fault tolerance.
         | 
         | Haha, we used to joke that's how many nines our customer-facing
         | Ruby on Rails services had compared against our resilient five
         | nines payments systems. Our heavy infra handled billions in
         | daily payment volume and couldn't go down.
         | 
         | With the Ruby teams, we often playfully quipped, "which nines
         | are those?" humorously implying the leading digit itself wasn't
         | itself a nine.
        
           | sokoloff wrote:
           | AKA: "We're closing in on our third 8 of uptime..."
        
         | gopher_space wrote:
         | > I expected a brave tale of RedPanda SREs valiantly fixing
         | things, or some cool automatic failover tech.
         | 
         | It's a tale of how they set things up so they wouldn't need to
         | valiantly fix things, and I think the subtext is probably that
         | Redpanda doesn't pass responsibility on to a third party.
         | 
         | There are plenty of domains and, more importantly, people who
         | need uptime guarantees to mean "fix estimate from a real human
         | working on the problem" and not eventual store credit. Payroll
         | is a classic example.
        
           | RadiozRadioz wrote:
           | Nothing about the way they architected their system even
           | mattered in this incident. Their service just wasn't using
           | any of the infrastructure that failed - there was no event
           | here that actually put their system design to the test. There
           | just isn't a story here.
           | 
           | It's like if the power went out in the building next door,
           | and you wrote a blog post about how amazing the reliability
           | of your office computers are compared to your neighbor. If
           | your power had gone out too but you had provisioned a bunch
           | of UPSs and been fine, then there's something to talk about.
           | 
           | To extend the analogy, if the neighborhood had a reputation
           | for brown-outs and you deliberately chose not to build an
           | office there, then maybe you have something. But here,
           | RedPanda's GCP offering is inside GCP, this failure in GCP
           | has never happened before, they just got lucky.
        
       | bdavbdav wrote:
       | "We got lucky as the way we designed it happened not to use the
       | part of the service that was degraded"
        
         | smoyer wrote:
         | And we're oblivious enough about that luck that we're patting
         | ourselves on the back in public.
        
           | belter wrote:
           | And we are linking our blog to the AWS doc on cell
           | architectures, while talking about multiaz clusters on GCP
           | azs that are nothing like that...
        
       | rybosome wrote:
       | Must be hell inside GCP right now. That was a big outage, and
       | they were tired of big outages years ago. It was already
       | extremely difficult to move quickly and get things done due to
       | the reliability red tape, and I have to imagine this will make it
       | even harder.
        
         | siscia wrote:
         | In fairness, their design does not seem to be regional. With
         | problems in one region bringing down another, apparently not
         | unrelated, region.
         | 
         | With this kind of architecture, this sort of problems is just
         | bound to happen.
         | 
         | During my time in AWS, region independence was a must. And some
         | services were able to operate at least for a while without
         | degrading also when some core dependencies were not available.
         | Think like loosing S3.
         | 
         | And after that, the service would keep operating, but with a
         | degraded experience.
         | 
         | I am stunned that this level of isolation is not common in GCP.
        
           | valenterry wrote:
           | How does AWS do that though? Do the re-implement all the code
           | in every region? Because even the slightest re-use of code
           | could trigger a synchronous (possibly delayed) downtime of
           | all regions.
        
             | crop_rotation wrote:
             | Reusing code doesn't trigger region dependencies.
             | 
             | > Do the re-implement all the code in every region?
             | 
             | Everyone does.
             | 
             | The difference is AWS very strongly ensures that regions
             | are independent failure domains. The GCP architecture is
             | global with all the pros and cons that implies. e.g GCP has
             | a truly global load balancer while AWS can not since
             | everything is at core regional.
        
             | nijave wrote:
             | They definitely roll out code (at least for some services)
             | one region at a time. That doesn't prevent old bugs/issues
             | from coming up but it definitely helps prevent new ones
             | from becoming global outages.
        
             | cyberax wrote:
             | Region (and even availability zones) in AWS are
             | independent. The regions all have overlapping IPv4
             | addresses, so direct cross-region connectivity is
             | impossible.
             | 
             | So it's actually really hard to accidentally make cross-
             | region calls, if you're working inside the AWS
             | infrastructure. The call has to happen over the public
             | Internet, and you need a special approval for that.
             | 
             | Deployments also happen gradually, typically only a few
             | regions at a time. There's an internal tool that allows
             | things to be gradually rolled out and automatically rolled
             | back if monitoring detects that something is off.
        
           | rybosome wrote:
           | Global dependencies were disallowed back in 2018 with a tiny
           | handful of exceptions that were difficult or impossible to
           | make fully regional. Chemist, the service that went down, was
           | one of those.
           | 
           | Generally GCP wants regionality, but because it offers so
           | many higher-level inter-region features, some kind of a
           | global layer is basically inevitable.
        
           | dangoodmanUT wrote:
           | Does Route53 depend on services in us-east-1 though? Or maybe
           | it's something else, but i recall us-east-1 downtime causing
           | service downtime for global services
        
             | cyberax wrote:
             | As far as I remember, Route53 is semi-regional. The master
             | copy is kept in us-east-1, but individual regions have
             | replicated data. So if us-east-1 goes down, the individual
             | regions will keep working with the last known state.
             | 
             | Amazon calls this "static stability".
        
               | toast0 wrote:
               | Static stability is a good start, but isn't enough.
               | 
               | In this outage, my service (on GCP) had static stability,
               | which was great. However, some other similar services
               | failed, and we got more load, but we couldn't start
               | additional instances to handle the load because of the
               | outage, and so we had overloaded servers and poor service
               | quality.
               | 
               | Mayhaps we could have adjusted load across regions to
               | manage instance load, but that's not something we
               | normally do.
        
           | flaminHotSpeedo wrote:
           | AWS regions are fundamentally different from GCP regions. GCP
           | marketing tries really hard to make it seem otherwise, or
           | that GCP has all the advantages of AWS regions plus the
           | advantages of their approach, which means heavily on
           | "effectively global" services. There are tradeoffs, for
           | example multi region in GCP is often trivial and GCP can
           | enforce fairness across regions, but that comes at the cost
           | of availability. Which would be fine - GCP SLA's reflect the
           | fact that they rarely consider regions to be a reliable fault
           | containers, but GCP marketing, IMO, creates a dangerous
           | situation by pretending to be something they aren't.
           | 
           | Even in the mini incident report they were going through
           | extreme linguistic gymnastics trying to claim they are
           | regional. Describing the service that caused the outage,
           | which is responsible for _global quota enforcement_ and is
           | configured using a data store that replicates data globally
           | in near real time, with apparently no option to delay
           | replication, they said:                  Service Control is a
           | regional service that has a regional datastore that it reads
           | quota and policy information from. This datastore metadata
           | gets replicated almost instantly globally to manage quota
           | policies for Google Cloud and our customers.
           | 
           | Not only would AWS call this a global service, the whole
           | concept of global quotas would not fly at AWS.
        
         | buremba wrote:
         | I think making the identity piece regional hurts the UX a lot.
         | I like GCP's approach, where you manage multiple regions with a
         | single identity, but I'm not sure how they can make it
         | resilient to regional failures.
        
           | nijave wrote:
           | Async replication? I think you could run semi independent
           | regions with an orchestrator that copies config to each one.
           | You'd go into a degraded read only state but it wouldn't be
           | hard down.
           | 
           | Of course bugs in the orchestrator could cause outages but
           | ideally that piece is a pretty simple "loop over regions and
           | call each regional API update method with the same arguments"
        
         | delusional wrote:
         | > they were tired of big outages years ago
         | 
         | One could hope that they'd realize whatever red tape they've
         | been putting up so far hasn't helped, and so more of it
         | probably wont either.
         | 
         | If what you're doing isn't having an effect you need to do
         | something different, not just more.
        
           | kubb wrote:
           | They'll do more of the same. The leads are clueless and
           | sensible voices of criticism are deftly squashed.
        
       | raverbashing wrote:
       | Lol I love how they call "not spreading your services needlessly
       | across many different servers" as an "Architectural Pattern"
       | (Cell based arch)
       | 
       | They are right, of course, but the way things, the obvious needs
       | to be said sometimes
        
         | macintux wrote:
         | Years ago I had the misfortune of helping a company recover
         | from an outage.
         | 
         | It turned out that they had services in two data centers for
         | redundancy, but they _divided their critical services between
         | them_.
         | 
         | So if either data center went offline, their whole stack was
         | dead. Brilliant design. That was a very long week; fortunately
         | by now I've forgotten most of it.
        
       | Peterpanzeri wrote:
       | "We got lucky as the way we designed it happened not to use the
       | part of the service that was degraded" this is a stupid statement
       | from them, hope they will be prepared next time
        
         | mankyd wrote:
         | Why is that stupid? They did get lucky. They are acknowledging
         | that, had they used that, they would have had problems. And now
         | they will work to be more prepared.
         | 
         | Acknowledging that one still has risks and that luck plays a
         | factor is important.
        
         | beefnugs wrote:
         | I learned a lesson : "use less cloud"
        
       | zzyzxd wrote:
       | The article is unnecessarily long only to brag about "a service
       | we didn't use went down so it didn't affect us". If I want to be
       | picky, their architecture is also not perfect:
       | 
       | - Their alerts were not durable. The outage took out the alert
       | system so humans were just eyeballing dashboards during the
       | outage. What if your critical system went down along with that
       | alert system, in the middle of night?
       | 
       | - The cloud marketplace service was affected by cloudflare outage
       | and there's nothiing they could do.
       | 
       | - Tiered stroage was down, disk usage went above normal level.
       | But there's no anomaly detection and no alerts. It survived
       | because t0 storage was massively over provisioned.
       | 
       | - They took pride in using industry well-known designs like cell-
       | based architecture, redundancy, multi-az...ChatGPT would be able
       | to give me a better list
       | 
       | And I don't get whey they had to roast Crowdstrike at the end. I
       | mean, the Crowdstrike incident was really amateur stuff, like,
       | the absolute lowest bar I can think of.
        
       | diroussel wrote:
       | > Modern computer systems are complex systems -- and complex
       | systems are characterized by their non-linear nature, which means
       | that observed changes in an output are not proportional to the
       | change in the input. This concept is also known in chaos theory
       | as the butterfly effect,
       | 
       | This isn't quite right. Linear systems can also be complex, and
       | linear dynamic systems can also exhibit the butterfly effect.
       | 
       | That is why the butterfly effect is so interesting.
       | 
       | Of course non-linear systems can have a large change in output
       | based on a small input, because they allow step changes, and many
       | other non-linear processes.
        
       ___________________________________________________________________
       (page generated 2025-06-21 23:01 UTC)