[HN Gopher] Lessons Learned from Twenty Years of Site Reliabilit...
___________________________________________________________________
Lessons Learned from Twenty Years of Site Reliability Engineering
Author : maheshrjl
Score : 291 points
Date : 2023-10-27 11:25 UTC (11 hours ago)
(HTM) web link (sre.google)
(TXT) w3m dump (sre.google)
| 6LLvveMx2koXfwn wrote:
| "We once narrowly missed a major outage because the engineer who
| submitted the would-be-triggering change unplugged their desktop
| computer before the change could propagate". Sorry, what?
| jrms wrote:
| I thought the same.
| jedberg wrote:
| The change was being orchestrated from their desktop, and they
| noticed thing were going sideways, so they unplugged their
| desktop to stop the deployment. Aka "pressed the big red
| button".
| francisofascii wrote:
| Yeah, interesting tidbit. It might sound insane today that one
| engineer's desktop computer could cause such an outage. But
| that was probably more commonplace 20 years ago and even today
| in smaller orgs.
| shadowgovt wrote:
| There was a famous incident at one point where code search
| internally went down. It turned out that while they had
| deployed the tool internally, one piece of the indexing
| process was still running as a cron job on the original
| developer's desktop machine. He went on vacation, his
| credentials aged out, and of crawler stopped refreshing the
| index.
|
| But my favorite incident will forever be the time they had to
| drill out a safe because they were disaster-testing the
| password vault system and discovered that the key needed to
| restore the password vault system was stored in s aafe, the
| combination for which had been moved into the password vault
| system. Only with advanced, modern technology can you lock
| the keys to the safe in the safe itself with so many steps!
| tomcam wrote:
| > But my favorite incident will forever be the time they
| had to drill out a safe because they were disaster-testing
| the password vault system
|
| Great story to be sure---but I'm going to call it a
| success. They did the end-to-end testing and caught it then
| instead of real life.
| shadowgovt wrote:
| Well, mostly-kinda-sorta. ;) It's the internal password
| vault, and there's only one of them, so it's more like
| "they broke it on purpose and then had to fix it before
| the company went off the rails." Among the things kept in
| that vault are credentials that if they age out or aren't
| regularly refreshed, key internal infrastructure starts
| grinding to a halt.
|
| But still, "it broke while engineers were staring at it
| and trying to break it" is a better scenario than "it
| broke surprisingly while engineers were trying to do
| something else."
| dilyevsky wrote:
| At one point I had to run a script on a substantial portion of
| their server fleet (like hundreds of thousands machines) and I
| remember I ran it with a pssh-style utility from desktop (was
| 10y ago so dunno if they still use this). It was surprisingly
| quick to do it this way. Could've been something like that
| codemac wrote:
| It always cracks me up how Google is simultaneously the most
| web-based company in the world, but their internal political
| landscape was (Infra, Search, Ads) > everything else. This
| leads to infra swe writing stupid CLIs all day, rather than
| having any literal buttons. Things were changing a lot by the
| time I left though.
|
| I do think Google should be more open about their internal
| outages. This one in particular was very famous internally.
| dilyevsky wrote:
| We also avoided some outages by running one-off scripts
| fleet-wide so it cuts both ways
| throwawaaarrgh wrote:
| Every large enterprise is internally a tire fire.
| jeffbee wrote:
| It's a logical consequence of the "zero trust" network. If an
| engineer's workstation can make RPCs to production systems, and
| that engineer is properly entitled to assume some privileged
| role, then there's no difference between running the automation
| in prod and running it on your workstation. Even at huge
| scales, shell tools plus RPC client CLIs can contact every
| machine in the world pretty promptly.
| jedberg wrote:
| This is a great writeup, and very broadly applicable. I don't see
| any "this would only apply at Google" in here.
|
| > COMMUNICATION CHANNELS! AND BACKUP CHANNELS!! AND BACKUPS FOR
| THOSE BACKUP CHANNELS!!!
|
| Yes! At Netflix, when we picked vendors for systems that we used
| during an outage, we always had to make sure they were _not_ on
| AWS. At reddit we had a server in the office with a backup IRC
| server in case the main one we used was unavailable.
|
| And I think Google has a backup IRC server on AWS, but that might
| just apocryphal.
|
| Always good to make sure you have a backup side channel that has
| as little to do with your infrastructure as possible.
| dilyevsky wrote:
| Afair Google just ran irc on their corp network which was
| completely separate from prod so I wouldn't be surprised if it
| was in a small server room in the office somewhere.
|
| > very broadly applicable. I don't see any "this would only
| apply at Google" in here.
|
| One thing I haven't even heard of anyone else doing was
| production panic rooms - a secure room with backup vpn to prod
| DaiPlusPlus wrote:
| > Afair Google just ran irc on their corp network which was
| completely separate from prod
|
| I thought Google didn't have a "corp" network because of
| their embrace of zero-trust in BeyondCorp?
| dilyevsky wrote:
| I don't think zero-trust prohibits network segmentation for
| redundancy or due to geographical constraints etc. It's
| mainly about how you gain access.
| Moto7451 wrote:
| Correct. At an old job we did zero trust corp on a
| different AWS region and account. The admin site was a
| different zero trust zone in prod region/account and was
| supposed to eventually become another AWS account in
| another region (for cost purposes).
|
| I can't say if any of this was ideal but it did work
| unobtrusively.
| shanemhansen wrote:
| They do. But I'd say most employees go their whole career
| without needing to do anything that requires a VPN.
|
| It's basically all web based access through what is, at the
| end of the day, a http proxy.
|
| SREs need to be ready for stuff like "hey, what if the big
| proxy we all use to access internal resources is down?".
| jeffbee wrote:
| When I was there, the main IRC ran in prod. But it was
| intentionally a low-dependency system, an actual IRC server
| instead of something ridiculous like gIRC or IRC-over-stubby.
| dilyevsky wrote:
| i think it had a corp dns label but i'm fuzzy on that. yes
| it could've been a prod instance which would mean you'd
| need to go to panic room in case corp was down but maybe
| that was the intention.
| gaogao wrote:
| Also make sure your backup channel can scale. Getting a flood
| of 10,000+ folks over to a dinky IRC server can knock it over
| easy. Throttling new joinees isn't a panacea either, since
| there might be someone critical to get on a channel which
| throttling can complicate.
| londons_explore wrote:
| Maybe I'm naive, but I would imagine that any raspberry pi
| could run an IRC server with 10,000 users...
|
| Surely 10,000 Users each on average receiving perhaps five 50
| byte messages per second (that's a very busy chat room!) is a
| total bandwidth of 2.5 megabytes per second.
|
| And the CPU time to shuffle messages from one TCP connection
| to another, or encrypt/decrypt 2.5 megabytes per second
| should be small. There is no complex parsing involved - it is
| literally just "forward this 50 byte message into this list
| of TCP connections".
|
| If they're all internal/authed users, you can hopefully
| assume nobody is deliberately DoSing the server with millions
| of super long messages into busy channels too.
| xena wrote:
| IRC servers are single threaded. You have contention at
| that point.
| strangattractor wrote:
| Brings back memories.
|
| I started getting phone calls at 5AM regarding servers going
| down. I immediately went to email to connect with other members
| of the team and found my inbox full of thousands of emails from
| the monitoring service notifying me of the outage. Those were
| days:)
| fragmede wrote:
| The ultimate in "this would only apply at Google", was having
| to establish backup communication side channels that weren't
| Google dependent. While on call as an SRE for Google on the
| Internet traffic team, we would naturally default to Google
| Meet for communications during an indent, but the questions is,
| what would we do if Google Meet is down? It's a critical-path
| team, which mean that if I got paged because my team's system
| is down, it's not improbable that Google Meet (along with
| Google.com) was down because of it.
|
| We needed to have extra layers of backup communications because
| the first three layers systems were all also Google properties.
| That is to say, email wouldn't work because Gmail is Google,
| our work cell phones wouldn't work because they're Google Fi,
| and my home's ISP was Google Fiber/Webpass.
|
| All of which is to confirm that, yes, Google has a backup IRC
| server for communication. I won't say where, but it's
| explicitly totally off Google infrastructure for that very
| reason.
| jeffrallen wrote:
| Hurricane Electric? He, he, he.
| hinkley wrote:
| Before divestiture we had two groups splitting responsibility
| for IT, so our DevOps team (bleh) only had partial control of
| things. We were then running in heterogenous mode - one on prem
| data center and cloud services.
|
| One day a production SAN went wonky, taking out our on prem
| data center. ... and also Atlassian with it. No Jira, no
| Confluence. Possibly no CI/CD as well. No carefully curated
| runbooks to recover. Just tribal knowledge.
|
| People. Were. Furious. And rightfully so. The 'IT' team lost
| control of a bunch of things in that little incident. Who puts
| customer facing and infrastructure eggs into the same basket
| like that?
| l9i wrote:
| > And I think Google has a backup IRC server on AWS, but that
| might just apocryphal.
|
| They (>1) do exist but not on AWS or any other major cloud
| provider.
|
| (Or at least that was the case two years ago when I still
| worked there.)
| tiddo_langerak wrote:
| I'm curious how people approach big red buttons and intentional
| graceful degradation in practice, and especially how to ensure
| that these work when the system is experiencing problems.
|
| E.g. do you use db-based "feature flags"? What do you do then if
| the DB itself is overloaded, or the API through which you access
| the DB? Or do you use static startup flags (e.g. env variables)?
| How do you ensure these get rolled out quickly enough? Something
| else entirely?
| shadowgovt wrote:
| When you're a small company, simpler is actually better... It's
| best to keep it simple so that recovery is easy over building
| out a more complicated solution that is reliable in the average
| case but fragile in the limits. Even if that means there's some
| places on the critical path where you don't use double
| redundancy but as a result the system is simple enough to fit
| in the heads of all the maintainers and can be rebooted or
| reverted easily.
|
| ... But once your firm starts making guarantees like "five
| nines uptime," there will be some complexity necessary to
| devise a system that can continue to be developed and improved
| while maintaining those guarantees.
| dilyevsky wrote:
| There's a chapter on client-side throttling in the sre book -
| https://sre.google/sre-book/handling-overload/
|
| At google we also had to routinely do "backend drains" of
| particular clusters when we deemed them unhealthy and they had
| a system to do that quickly at the api/lb layer. At other
| places I've also seen that done with application level flags so
| you'd do kubectl edit which is obviously less than ideal but
| worked
| justapassenger wrote:
| Implantation details will depend on your stack, but 3 main
| things I'd keep in mind:
|
| 1. Keep it simple. No elaborate logic. No complex data stores.
| Just a simple checking of the flag.
|
| 2. Do it as close to the source as possible, but have limited
| trust in your clients - you may have old versions, things not
| propagating, bugs, etc. So best to have option to degrade both
| in the client and on the server. If you can do only one, so the
| server side.
|
| 3. Real world test! And test often. Don't trust test
| environment. Test on real world traffic. Do periodic tests at
| small scale (like 0.1% of traffic) but also do more full scale
| tests on a schedule. If you didn't test it, it won't work when
| you need it. If it worked a year ago, it will likely not work
| now. If it's not tested, it'll likely cause more damage than
| it'll solve.
| tomcam wrote:
| 4. Document procedures and methodically test the docs as well
| dineshkumar_cs wrote:
| Versioning, feature toggle and rollback - automated and
| implemented at different level based on the system. It could be
| an env configuration, or db field or down migration scripts or
| deploying last working version.
| anonacct37 wrote:
| I worked at enough companies to see that many of them have some
| notion of "centralized config that's rolled to the edge and can
| be updated at runtime".
|
| I've done this with djb's CDB (constant database). But I've
| seen people poll an API for JSON config files or
| dbm/gdbm/Berkeleydb/leveldb.
|
| This can extend to other big red buttons. It's not that elegant
| but I've had numerous services that checked for the presence of
| a file to serve health checks. So pulling a node out of load
| balancer rotation was as easy as creating a file.
|
| The idea is that then when there's a datab outage the system
| defaults to serving the last known good config.
| masto wrote:
| It depends.
|
| To make up an example that doesn't depend on any of those
| things: imagine that I've added a new feature to Hacker News
| that allows users to display profile pictures next to their
| comments. Of course, we have built everything around
| microservices, so this is implemented by the frontend page
| generator making a call to the profile service, which does a
| lookup and responds with the image location. As part of the
| launch plan, I document the "big red button" procedure to
| follow if my new component is overloading the profile service
| or image repository: run this command to rate-limit my
| service's outgoing requests at the network layer (probably to 0
| in an emergency). It will fail its lookups and the page
| generator is designed to gracefully degrade by continuing to
| render the comment text, sans profile photo.
|
| (Before anyone hits send on that "what a stupid way to do X"
| reply, please note that this is not an actual design doc, I'm
| not giving advice on how to build anything, it's just a crayon
| drawing to illustrate a point)
| Dowwie wrote:
| If you're using AWS resources, give LocalStack a try for
| integration testing
| whummer wrote:
| 100% - Including some of the Chaos Engineering features that
| are recently offered in the platform (e.g., simulating service
| errors/latencies, region outages, etc)
| galkk wrote:
| Wife used localstack at previous job and it was miserable
| experience. Especially their emulation of queues.
|
| Maybe things have improved since couple years ago, though
| fensterblick wrote:
| Recently, I've heard of several companies folding up SRE and
| moving individuals to their SWE teams. Rumors are that LinkedIn,
| Adobe, and Robinhood have done this.
|
| This made me think: is SRE a byproduct of a bubble economy of
| easy money? Why not operate without the significant added expense
| of SRE teams?
|
| I wonder if SRE will be around 10 years from now, much like Sys
| Admins and QA testers have mostly disappeared. Instead, many of
| those functions are performed by software development teams.
| kccqzy wrote:
| SRE is not a byproduct of a bubble economy. I believe Google
| has had SREs since the very beginning. But still I think the
| rest of the point still stands. These days with devops the
| skill set needed for devs have indeed expanded to have
| significant overlap with SREs. I expect companies to downsize
| their SRE teams and distribute responsibilities to devs.
|
| A second major reason is automation. If you read the linked
| site long enough you'll find that in the early days of Google,
| SREs did plenty of manual work like deployments while manually
| watching graphs. They were indispensable then simply because
| even Google didn't have enough automation in their systems! You
| can read the story of Sisyphus
| https://www.usenix.org/sites/default/files/conference/protec...
| to kind of understand how Google's initial failure of adopting
| standardized automation ensured job security for SREs.
| davedx wrote:
| > "These days with devops the skill set needed for devs have
| indeed expanded to have significant overlap with SREs"
|
| Respectfully disagree on this. SRE is a huge complex realm
| unto itself. Just understanding how all the cloud components
| and environments and role systems work together is multiple
| training courses, let alone how to reliably deploy and run in
| them.
| derefr wrote:
| But modern approaches to dev require the SWEs to understand
| and model the operation of their software, and in fact
| program in terms of it -- "writing infrastructure" rather
| than just code.
|
| Lambda functions, for example: you have to understand their
| performance and scalability characteristics -- in turn
| requiring knowledge of things like the latency added by
| crossing the boundary between a managed shared service
| cluster and a VPC -- in order to understand how and where
| to factor things into individual deployable functions.
| dekhn wrote:
| Pedantically, Google didn't have SREs as the beginning. I
| asked a very early SRE, Lucas,
| (https://www.nytimes.com/2002/11/28/technology/postcards-
| from... and https://hackernoon.com/this-is-going-to-be-huge-
| google-found...), and he said that in the early days, outages
| would be really distracting to "the devs like Jeff and
| Sanjay" and he and a few others ended up forming SRE to
| handle site reliability more formally during the early days
| of growth, when Google got a reputation for being fast and
| scalable and nearly always up.
|
| Lucas helped make one of my favorite Google Historical
| Artefacts, a crayon chart of search volume. They had to
| continuously rescale the graph in powers of ten due to
| exponential growth.
|
| I miss pre-IPO Google and the Internet of that time.
| jeffrallen wrote:
| The idea of ops people who wrote code for deployment and
| monitoring and had responsibility for incident management and
| change control existed before Google gave it a name.
|
| Source: I was one at WebTV in 1996, and I worked with people
| who did it at Xerox PARC and General Magic long before then.
| vitalysh wrote:
| I guess now they have a team of software engineers, where part
| is focused on infra and part on backend. Sys Admins
| disappeared? They are DevOps/IT Engineers now. QA? SWE in Test,
| and so on.
| vsareto wrote:
| >Instead, many of those functions are performed by software
| development teams.
|
| And they likely won't be as good as dedicated SRE teams.
|
| But few businesses care about that right now considering
| layoffs.
|
| Throwing developers at a problem even if it isn't in their
| skill set is an industry trend that won't go away and be more
| pronounced during downturns.
|
| Full stack developers are a great example of rolling two roles
| together without twice the pay.
| davedx wrote:
| Eh it's not the same thing. (I'm very full stack with
| intermittent devops/sre experience).
|
| Full stack means you write code running on back end and front
| end. 99% of the time the code you write on the FE interfaces
| with your other code for the BE. It's pretty coherent and
| feedback loops are similar.
|
| Devops/SRE on the other hand is very different and I agree we
| shouldn't expect software developers be mixing in SRE in
| their day to day. The skills, tools, mindset, feedback loop,
| and stress levels are too different.
|
| If you're not doing simple monoliths then you need a
| dedicated devops/SRE team.
| vsareto wrote:
| If you can be good at front and back end and keep up with
| both of them simultaneously, that's great, but:
|
| - you spend more time to keep up with both of those sectors
| compared to dedicated front or back end positions
|
| - you context switch more often than dedicated positions
|
| - you spent more time getting good at both of those things
|
| - you removed some amount of communication overhead if
| there were two positions
|
| You are definitely not being compensated for that extra
| work and benefit to the business given that full stack
| salaries are close to front end and back end position
| salaries.
| sarchertech wrote:
| I haven't noticed that in my corner of one of those mentioned
| companies. Also I'm not an SRE, but during the height of the
| recent tech layoffs the only job postings I was seeing was for
| SRE.
| gen220 wrote:
| I imagine the threshold is something like 1 SRE for every $1mm
| of high-margin revenue you can link to guaranteeing the 2nd "9"
| of $product availability/reliability.
| donalhunt wrote:
| I believe that is indeed a good guide for when it makes sense
| to have a SRE team supporting a service or product (with the
| caveat that the number probably isn't $1MM).
|
| There are also good patterns for ensuring you actually have
| adequate SRE coverage for what the business needs. 2 x 6ppl
| teams geo-graphically dispersed doing 7x12 shifts works
| pretty well (not cheap). You can do it with less but you run
| into more challenges when individuals leave / get burnt out /
| etc.
| VirusNewbie wrote:
| >s SRE a byproduct of a bubble economy of easy money? Why not
| operate without the significant added expense of SRE teams?
|
| I'm a SWE SRE. I think in some cases it is better to be folded
| into a team. In other cases, less so.
|
| One SRE team can support _many_ different dev teams, and often
| the dev teams are not at all focusing time on the very
| complicated infra /distributed systems aspect of their job,
| it's just not something that they worry about day to day.
|
| So it makes sense to have an 'infra' that operates at a
| different granularity than specialized dev teams.
|
| That may or may not need to be called SRE, or maybe it's an SRE
| SWE team, or maybe you just call it 'infrastructure' but at a
| certain scale you have more cross cutting concerns across teams
| where it's cheaper to split things out that way.
| xorcist wrote:
| Sys admins changed name to SREs which changed named to devops
| engineers or cloud engineers or whatever the title is now.
|
| Still the same competency. Someone needs to know how those
| protocols work, tell you latency characteristics of storage,
| and read those core dumps.
| VirusNewbie wrote:
| In my G SRE interview, I had to do the same rigorous Software
| Engineering algorithms rounds as well as show deep
| distributed systems knowledge in designing highly available
| systems.
| tayo42 wrote:
| Two part to this,
|
| Is sre a bubble thing.
|
| I never got why SRE existed.(SRE has been my title...) The job
| responsibilities, care about monitoring, logging, performance,
| metrics of applications are all things a qualified developer
| should be doing. Offloading caring about operating the software
| someone writes to someone else just seems illogical to me. Put
| the swes on call. If swes think the best way to do something is
| manual, have them do it them selves, then fire them for being
| terrible engineers. All these tedious interviews and a SWE
| doesn't know how the computer they are programing works? Its
| insane. All that schooling and things like how does the OS
| work, which is part of an undergrad curriculum, gets offloaded
| to a career and title mostly made up of self taught sysadmin
| people? Every good swe Ive known, knew how the os, computer,
| network works.
|
| > if SRE will be around 10 years from now,
|
| Other tasks that SRE typically does now, generalized
| automation, provide dev tools and improve dev experience, is
| being moved to "platform" and teams with those names. I expect
| it to change significantly.
| nrr wrote:
| Oddly, the call to put the SWEs in the on-call rotation was
| one of the original goals of site reliability engineering as
| an institutional discipline. The idea at conception was that
| SREs were expensive, and only after product teams got their
| act together could they truly justify the cost of full-time
| reliability engineering support.
|
| It's only in the past 10 years (reasonable people may
| disagree on that figure) that being a site reliability
| engineer came to mean being something other than a
| professional cranky jackass.
|
| What I care about as an SRE is not graphs or performance or
| even whether my pager stays silent (though, that would be
| nice). No, I want the product teams to have good enough tools
| (and, crucially, the knowledge behind them) to keep hitting
| their goals.
|
| Sometimes, frankly, the monitoring and performance get in the
| way of that.
| gen220 wrote:
| > Other tasks that SRE typically does now, generalized
| automation, provide dev tools and improve dev experience, is
| being moved to "platform" and teams with those names. I
| expect it to change significantly.
|
| Yeah, this is my experience, too. "DevOps" (loosely, the
| trend you describe in the first paragraph) is eating SRE from
| one end and "Platform" from the other. SRE are basically
| evolving into "System Engineers" responsible for operating
| and sometimes developing common infrastructure and its
| associated tools.
|
| I don't think that's a bad thing at all! Platform engineering
| is more fun, you're distributing the load of responsibility
| in a way that's really sensible, and engineers who are
| directly responsible for tracking regressions, performance,
| and whatnot ime develop better products.
| crabbone wrote:
| > much like [...] QA testers have mostly disappeared.
|
| Who told you that?
|
| QA isn't going anywhere... someone is doing testing, and that
| someone is a tester. They can be an s/w engineer by training,
| but as long as they are testing they are a tester.
|
| With sysadmins, there are fashion waves, where they keep being
| called different names like DevOps or SRE. I've not heard of
| such a thing with testing.
| jeffbee wrote:
| If you think of an SRE as an expensive sysadmin then yes, you
| should absolutely scratch that entire org. SRE, by Google's
| definition, is supposed to contain software engineers with deep
| systems expertise, not some kind of less-qualified SWEs.
| beoberha wrote:
| I work for a large cloud service that is not Google where the
| SRE culture varies heavily depending on which product you're
| building. SREs are a necessity to free up devs to do actual dev
| work. Platform and infra teams should tightly couple SWEs and
| SREs to keep SWEs accountable, but not responsible for day to
| day operations of the infra - you'll never get anything done :)
| icedchai wrote:
| SysAdmins didn't disappear, they just learned some cloud stuff
| and changed titles. We call them "DevOps Engineers" now.
| Smaug123 wrote:
| For much _much_ more on this, I 'm most of the way through
| Google's book _Building Secure and Reliable Systems_, which is a
| proper textbook (not light reading). It's a pretty interesting
| book! A lot of what it says is just common sense, but as the
| saying goes, "common sense" is an oxymoron; it's felt useful to
| have refreshed my knowledge of the whole thing at once.
| benlivengood wrote:
| Something I hope to eventually hear is the solution to the full
| cold start problem. Most giant custom-stack companies have
| circular dependencies on core infrastructure. Software-defined
| networking needs some software running to start routing packets
| again, diskless machines need some storage to boot from,
| authentication services need access to storage to start handing
| out service credentials to bootstrap secure authz, etc.
|
| It's currently handled by running many independent regions so
| that data centers can be brought up from fully dark by
| bootstrapping them from existing infra. I haven't heard of anyone
| bringing the stack up from a full power-off situation. Even when
| Facebook completely broke its production network a couple years
| ago the machines stayed on and running and had some internal
| connectivity.
|
| This matters to everyone because while cloud resources are great
| at automatic restarts and fault recovery there's no guarantee
| that AWS, GCP, and friends would come back up after, e.g., a
| massive solar storm that knocks out the grid worldwide for long
| enough to run the generators down.
|
| My guess is that there are some dedicated small DCs with
| exceptional backup power and the ability to be fully isolated
| from grid surges (flywheel transformers or similar).
| Gh0stRAT wrote:
| Azure has procedures in place to prevent circular dependencies,
| and regularly exercises them when bringing new regions online.
|
| IIRC some of the information about their approach is considered
| sensitive so I won't elaborate further.
| throwawaaarrgh wrote:
| The solution to this has to be done earlier, but it's simple:
| start a habit of destroying and recreating everything. If you
| wait to start doing this, it's very painful. If you start doing
| it at the very beginning, you quickly get used to it, and
| breaking changes and weird dependencies are caught early.
|
| You can even do this with hardware. It changes how you
| architect things, to deal with shit getting unplugged or reset.
| You end up requiring more automation, version control and
| change management, which speeds up and simplifies overall work,
| in addition to preventing and quickly fixing outages. It's a
| big culture shift.
| jeffbee wrote:
| When I was in Google SRE we had monitoring and enforcement of
| permitted and forbidden RPC peers, such that a system that
| attempted to use another system would fail or send alerts. This
| was useful at the top of the stack to keep track of
| dependencies silently added by library authors, and at the low
| levels to ensure the things at the bottom of the stack were
| really at the bottom. We also did virtual automated cluster
| turn-up and turn-down, to make sure our documented procedures
| did not get out of date, and in my 6 years in SRE I saw that
| procedure fall from 90 days to under an hour. We also regularly
| exercised the scratch restarts of things like global encryption
| key management, which involves a physical object. The annual
| DiRT exercise also tried to make sure that no person, team, or
| office was necessary to the continuing function of systems.
| jeffrallen wrote:
| The power grid guys claim to have cold start plans locked and
| loaded, but I'm not convinced they would work. Anyone seen an
| after-action report saying how well a real grid cold start
| went? It would also be interesting to know which grid has had
| the most cold starts: in a perfect world, they'd be good at it
| by now. Bet it's in the Caribbean or Africa. But it's funny:
| small grid cold starts (i.e. an isolated island with one diesel
| generator and some solar) are so easy they probably wouldn't
| make good case studies.
|
| It's clear that the Internet itself could not be cold started
| like that AC grids, there's simply too many AS's. (Think about
| what AS means for a second to understand why a coordinated,
| rehearsed cold start is not possible.)
| alexpotato wrote:
| If you are interested in a similar list but with a bent towards
| being a SRE for 15 years in FinTech/Banks/Hedge Funds/Crypto, let
| me humbly suggest you check out:
|
| https://x.com/alexpotato/status/1432302823383998471?s=20
|
| Teaser: "25. If you have a rules engine where it's easier to make
| a new rule than to find an existing rule based on filter
| criteria: you will end up with lots of duplicate rules."
| xyst wrote:
| Off topic: TIL Google has its own TLD (.google)
| DaiPlusPlus wrote:
| So does .airbus, .barclays, ,mcrosoft, and .travelersinsurance
| too - it's nothing new
|
| https://data.iana.org/TLD/tlds-alpha-by-domain.txt
| throwawaaarrgh wrote:
| The cheapest way to prevent an outage is to catch it early in its
| lifecycle. Software bugs are like real bugs. First is the egg,
| that's the idea of the change. Then there's the nymph, when it
| hatches; first POC. By the time it hits production, it's an
| adult.
|
| Wait - isn't there a stage before adulthood? Yes! Your app should
| have several stages of maturity before it reaches adulthood. It's
| far cheaper to find that bug before it becomes fully grown (and
| starts laying its own eggs!)
|
| If you can't do canaries and rollbacks are problematic, add more
| testing before the production deploy. Linters, unit tests, end to
| end tests, profilers, synthetic monitors, read-only copies of
| production, performance tests, etc. Use as many ways as you can
| to find the bug early.
|
| Feature flags, backwards compatibility, and other methods are
| also useful. But nothing beats Shift Left.
___________________________________________________________________
(page generated 2023-10-27 23:00 UTC)