[HN Gopher] We unplugged a data center to test our disaster read...
___________________________________________________________________
We unplugged a data center to test our disaster readiness
Author : ianrahman
Score : 187 points
Date : 2022-04-25 14:47 UTC (1 days ago)
(HTM) web link (dropbox.tech)
(TXT) w3m dump (dropbox.tech)
| rkagerer wrote:
| This read to me as much like a history of technical debt as an
| article about current efforts.
| eptcyka wrote:
| Given that the blog doesn't load for me, I guess the datacenter
| remains unplugged?
| [deleted]
| deathanatos wrote:
| > _Not Found_
|
| > _The requested URL /infrastructure/disaster-readiness-test-
| failover-blackhole-sjc was not found on this server._
|
| > _Additionally, a 404 Not Found error was encountered while
| trying to use an ErrorDocument to handle the request._
|
| I see which data center was hosting the article, then.
|
| Internet Archive to the rescue:
| https://web.archive.org/web/20220426191128/https://dropbox.t...
| lloydatkinson wrote:
| I admire the forward thinking and application of common sense
| here. There's literally no better way of testing the system than
| this. It seems that a lot of big tech companies would never have
| the balls to do this themselves.
| mike_d wrote:
| Netflix has Chaos Gorilla, Facebook has Storms(?), Google has
| DiRT. Everyone does this type of testing.
| aaaaaaaaata wrote:
| ckwalsh wrote:
| Similar to Facebook's Storm initiative:
| https://www.forbes.com/sites/roberthof/2016/09/11/interview-...
|
| These exercises happen several times a year.
| Shish2k wrote:
| I used to be heavily involved in those, it was a process where
| we'd take a week to prepare for it, do weeks of post-mortems,
| and print a run of t-shirts for everyone involved to celebrate
| pulling one off successfully.
|
| These days the team running them announces that it's happening
| in an opt-in announcement group at 8am, pulls the plug at 9am,
| and barely anyone even notices because the automation handles
| it so gracefully.
|
| Mostly I just miss the t-shirts, as the <datacenter>-storm
| events got the coolest graphical designs...
| bob1029 wrote:
| > This complex ownership model made it hard to move just a subset
| of user data to another region.
|
| Welcome to basically any large-scale enterprise.
|
| I have grown to learn that the active-passive strategy is the
| best option if the business can tolerate the necessary human
| delays. You get to side-step so much complicated bullshit with
| this path.
|
| Trying to force active-active instant failover magic is sometimes
| not possible, profitable or even desirable. I can come up with a
| few scenarios where I would absolutely _insist_ that a human go
| and check a few control points before letting another exact copy
| of the same system start up on its own, even if it would be
| possible in theory to have automatic failover work reliably
| 99.999% of the time.
| dboreham wrote:
| My fear with any sort of passive standby approach is that when
| the disaster comes, that standby won't work, or the mechanism
| used to fail over to it won't work. I prefer schemes where the
| "failover" is happening all the time hence I can be confident
| it works.
| mike_d wrote:
| A solid active/standby design should regularly flop. Every 2
| weeks seems to be a sweet spot. This also balances wear
| across consumable hardware like disks.
|
| If your failover is happening "all the time" you basically
| just have a single system with failures.
| orev wrote:
| Having a passive strategy doesn't mean you don't test it, and
| you can even perform actual failovers once on a while to
| validate everything.
|
| Active-active is also valid, but the point is that it comes
| with a huge amount of increased complexity. At some point you
| need to make a value calculation to decide if you want to
| focus on that, or on building the product.
| [deleted]
| aftbit wrote:
| How big is metaserver these days?
|
| I might run three deployments at each data center: the primary,
| and secondaries for two other regions. Replicate between them at
| the block device level, bypassing the mysql replication situation
| entirely (except for on-disk consistency requirements of course).
|
| Of course this comes with a 3x increase in service infrastructure
| costs because of the two backups in each data center that are
| idle waiting for load.
| Johnny555 wrote:
| And much higher write latency, assuming write consistency is
| important to you.
| outside1234 wrote:
| DiRT
| zoover2020 wrote:
| Take that, ChaosMonkey! This is King Kong
| qw3rty01 wrote:
| Chaos Gorilla has already existed for over a decade
| beckman466 wrote:
| gotta prepare for the climate crisis, as Douglas Rushkoff
| discovered, rich capitalists "are plotting to leave us behind"
|
| https://onezero.medium.com/survival-of-the-richest-9ef6cddd0...
|
| unpaywalled: https://archive.ph/AABsP
| aftbit wrote:
| >Given San Jose's proximity to the the San Andreas Fault, it was
| critical we ensured an earthquake wouldn't take Dropbox offline.
|
| >Given this context, we structured our RTO--and more broadly, our
| disaster readiness plans--around imminent failures where our
| primary region is still up, but may not be for long.
|
| IMO if the big one hits San Andreas, the SJC facilities will
| likely go down with ~0 warning. Certainly not enough time to
| drain user traffic.
|
| It's interesting to note that Dropbox realistically can probably
| tolerate a loss of a few second to minutes of user data in a
| major earthquake, but cannot tolerate the same losses to perform
| realistic tests (just yank the cable, no warning).
|
| If the earthquake hits at 3am in SF, it'll likely take both the
| metro and a significant number of the DR team out of the picture
| for at least a period of time. Surviving that kind of blow in the
| short term with 0 downtime is a very hard goal.
| paxys wrote:
| A random (large) earthquake along the San Andreas fault is not
| the same as all of western USA ripping apart. A much more
| likely scenario is that the power grid goes down and the data
| center stays reasonably intact for a while on emergency power.
| [deleted]
| dmitrygr wrote:
| Google does this yearly, with different scenarios - it is called
| DiRT and you can read a little here:
| https://cloud.google.com/blog/products/management-tools/shri...
___________________________________________________________________
(page generated 2022-04-26 23:00 UTC)