[HN Gopher] Lessons Learned from Twenty Years of Site Reliabilit...
       ___________________________________________________________________
        
       Lessons Learned from Twenty Years of Site Reliability Engineering
        
       Author : maheshrjl
       Score  : 291 points
       Date   : 2023-10-27 11:25 UTC (11 hours ago)
        
 (HTM) web link (sre.google)
 (TXT) w3m dump (sre.google)
        
       | 6LLvveMx2koXfwn wrote:
       | "We once narrowly missed a major outage because the engineer who
       | submitted the would-be-triggering change unplugged their desktop
       | computer before the change could propagate". Sorry, what?
        
         | jrms wrote:
         | I thought the same.
        
         | jedberg wrote:
         | The change was being orchestrated from their desktop, and they
         | noticed thing were going sideways, so they unplugged their
         | desktop to stop the deployment. Aka "pressed the big red
         | button".
        
         | francisofascii wrote:
         | Yeah, interesting tidbit. It might sound insane today that one
         | engineer's desktop computer could cause such an outage. But
         | that was probably more commonplace 20 years ago and even today
         | in smaller orgs.
        
           | shadowgovt wrote:
           | There was a famous incident at one point where code search
           | internally went down. It turned out that while they had
           | deployed the tool internally, one piece of the indexing
           | process was still running as a cron job on the original
           | developer's desktop machine. He went on vacation, his
           | credentials aged out, and of crawler stopped refreshing the
           | index.
           | 
           | But my favorite incident will forever be the time they had to
           | drill out a safe because they were disaster-testing the
           | password vault system and discovered that the key needed to
           | restore the password vault system was stored in s aafe, the
           | combination for which had been moved into the password vault
           | system. Only with advanced, modern technology can you lock
           | the keys to the safe in the safe itself with so many steps!
        
             | tomcam wrote:
             | > But my favorite incident will forever be the time they
             | had to drill out a safe because they were disaster-testing
             | the password vault system
             | 
             | Great story to be sure---but I'm going to call it a
             | success. They did the end-to-end testing and caught it then
             | instead of real life.
        
               | shadowgovt wrote:
               | Well, mostly-kinda-sorta. ;) It's the internal password
               | vault, and there's only one of them, so it's more like
               | "they broke it on purpose and then had to fix it before
               | the company went off the rails." Among the things kept in
               | that vault are credentials that if they age out or aren't
               | regularly refreshed, key internal infrastructure starts
               | grinding to a halt.
               | 
               | But still, "it broke while engineers were staring at it
               | and trying to break it" is a better scenario than "it
               | broke surprisingly while engineers were trying to do
               | something else."
        
         | dilyevsky wrote:
         | At one point I had to run a script on a substantial portion of
         | their server fleet (like hundreds of thousands machines) and I
         | remember I ran it with a pssh-style utility from desktop (was
         | 10y ago so dunno if they still use this). It was surprisingly
         | quick to do it this way. Could've been something like that
        
         | codemac wrote:
         | It always cracks me up how Google is simultaneously the most
         | web-based company in the world, but their internal political
         | landscape was (Infra, Search, Ads) > everything else. This
         | leads to infra swe writing stupid CLIs all day, rather than
         | having any literal buttons. Things were changing a lot by the
         | time I left though.
         | 
         | I do think Google should be more open about their internal
         | outages. This one in particular was very famous internally.
        
           | dilyevsky wrote:
           | We also avoided some outages by running one-off scripts
           | fleet-wide so it cuts both ways
        
           | throwawaaarrgh wrote:
           | Every large enterprise is internally a tire fire.
        
         | jeffbee wrote:
         | It's a logical consequence of the "zero trust" network. If an
         | engineer's workstation can make RPCs to production systems, and
         | that engineer is properly entitled to assume some privileged
         | role, then there's no difference between running the automation
         | in prod and running it on your workstation. Even at huge
         | scales, shell tools plus RPC client CLIs can contact every
         | machine in the world pretty promptly.
        
       | jedberg wrote:
       | This is a great writeup, and very broadly applicable. I don't see
       | any "this would only apply at Google" in here.
       | 
       | > COMMUNICATION CHANNELS! AND BACKUP CHANNELS!! AND BACKUPS FOR
       | THOSE BACKUP CHANNELS!!!
       | 
       | Yes! At Netflix, when we picked vendors for systems that we used
       | during an outage, we always had to make sure they were _not_ on
       | AWS. At reddit we had a server in the office with a backup IRC
       | server in case the main one we used was unavailable.
       | 
       | And I think Google has a backup IRC server on AWS, but that might
       | just apocryphal.
       | 
       | Always good to make sure you have a backup side channel that has
       | as little to do with your infrastructure as possible.
        
         | dilyevsky wrote:
         | Afair Google just ran irc on their corp network which was
         | completely separate from prod so I wouldn't be surprised if it
         | was in a small server room in the office somewhere.
         | 
         | > very broadly applicable. I don't see any "this would only
         | apply at Google" in here.
         | 
         | One thing I haven't even heard of anyone else doing was
         | production panic rooms - a secure room with backup vpn to prod
        
           | DaiPlusPlus wrote:
           | > Afair Google just ran irc on their corp network which was
           | completely separate from prod
           | 
           | I thought Google didn't have a "corp" network because of
           | their embrace of zero-trust in BeyondCorp?
        
             | dilyevsky wrote:
             | I don't think zero-trust prohibits network segmentation for
             | redundancy or due to geographical constraints etc. It's
             | mainly about how you gain access.
        
               | Moto7451 wrote:
               | Correct. At an old job we did zero trust corp on a
               | different AWS region and account. The admin site was a
               | different zero trust zone in prod region/account and was
               | supposed to eventually become another AWS account in
               | another region (for cost purposes).
               | 
               | I can't say if any of this was ideal but it did work
               | unobtrusively.
        
             | shanemhansen wrote:
             | They do. But I'd say most employees go their whole career
             | without needing to do anything that requires a VPN.
             | 
             | It's basically all web based access through what is, at the
             | end of the day, a http proxy.
             | 
             | SREs need to be ready for stuff like "hey, what if the big
             | proxy we all use to access internal resources is down?".
        
           | jeffbee wrote:
           | When I was there, the main IRC ran in prod. But it was
           | intentionally a low-dependency system, an actual IRC server
           | instead of something ridiculous like gIRC or IRC-over-stubby.
        
             | dilyevsky wrote:
             | i think it had a corp dns label but i'm fuzzy on that. yes
             | it could've been a prod instance which would mean you'd
             | need to go to panic room in case corp was down but maybe
             | that was the intention.
        
         | gaogao wrote:
         | Also make sure your backup channel can scale. Getting a flood
         | of 10,000+ folks over to a dinky IRC server can knock it over
         | easy. Throttling new joinees isn't a panacea either, since
         | there might be someone critical to get on a channel which
         | throttling can complicate.
        
           | londons_explore wrote:
           | Maybe I'm naive, but I would imagine that any raspberry pi
           | could run an IRC server with 10,000 users...
           | 
           | Surely 10,000 Users each on average receiving perhaps five 50
           | byte messages per second (that's a very busy chat room!) is a
           | total bandwidth of 2.5 megabytes per second.
           | 
           | And the CPU time to shuffle messages from one TCP connection
           | to another, or encrypt/decrypt 2.5 megabytes per second
           | should be small. There is no complex parsing involved - it is
           | literally just "forward this 50 byte message into this list
           | of TCP connections".
           | 
           | If they're all internal/authed users, you can hopefully
           | assume nobody is deliberately DoSing the server with millions
           | of super long messages into busy channels too.
        
             | xena wrote:
             | IRC servers are single threaded. You have contention at
             | that point.
        
         | strangattractor wrote:
         | Brings back memories.
         | 
         | I started getting phone calls at 5AM regarding servers going
         | down. I immediately went to email to connect with other members
         | of the team and found my inbox full of thousands of emails from
         | the monitoring service notifying me of the outage. Those were
         | days:)
        
         | fragmede wrote:
         | The ultimate in "this would only apply at Google", was having
         | to establish backup communication side channels that weren't
         | Google dependent. While on call as an SRE for Google on the
         | Internet traffic team, we would naturally default to Google
         | Meet for communications during an indent, but the questions is,
         | what would we do if Google Meet is down? It's a critical-path
         | team, which mean that if I got paged because my team's system
         | is down, it's not improbable that Google Meet (along with
         | Google.com) was down because of it.
         | 
         | We needed to have extra layers of backup communications because
         | the first three layers systems were all also Google properties.
         | That is to say, email wouldn't work because Gmail is Google,
         | our work cell phones wouldn't work because they're Google Fi,
         | and my home's ISP was Google Fiber/Webpass.
         | 
         | All of which is to confirm that, yes, Google has a backup IRC
         | server for communication. I won't say where, but it's
         | explicitly totally off Google infrastructure for that very
         | reason.
        
           | jeffrallen wrote:
           | Hurricane Electric? He, he, he.
        
         | hinkley wrote:
         | Before divestiture we had two groups splitting responsibility
         | for IT, so our DevOps team (bleh) only had partial control of
         | things. We were then running in heterogenous mode - one on prem
         | data center and cloud services.
         | 
         | One day a production SAN went wonky, taking out our on prem
         | data center. ... and also Atlassian with it. No Jira, no
         | Confluence. Possibly no CI/CD as well. No carefully curated
         | runbooks to recover. Just tribal knowledge.
         | 
         | People. Were. Furious. And rightfully so. The 'IT' team lost
         | control of a bunch of things in that little incident. Who puts
         | customer facing and infrastructure eggs into the same basket
         | like that?
        
         | l9i wrote:
         | > And I think Google has a backup IRC server on AWS, but that
         | might just apocryphal.
         | 
         | They (>1) do exist but not on AWS or any other major cloud
         | provider.
         | 
         | (Or at least that was the case two years ago when I still
         | worked there.)
        
       | tiddo_langerak wrote:
       | I'm curious how people approach big red buttons and intentional
       | graceful degradation in practice, and especially how to ensure
       | that these work when the system is experiencing problems.
       | 
       | E.g. do you use db-based "feature flags"? What do you do then if
       | the DB itself is overloaded, or the API through which you access
       | the DB? Or do you use static startup flags (e.g. env variables)?
       | How do you ensure these get rolled out quickly enough? Something
       | else entirely?
        
         | shadowgovt wrote:
         | When you're a small company, simpler is actually better... It's
         | best to keep it simple so that recovery is easy over building
         | out a more complicated solution that is reliable in the average
         | case but fragile in the limits. Even if that means there's some
         | places on the critical path where you don't use double
         | redundancy but as a result the system is simple enough to fit
         | in the heads of all the maintainers and can be rebooted or
         | reverted easily.
         | 
         | ... But once your firm starts making guarantees like "five
         | nines uptime," there will be some complexity necessary to
         | devise a system that can continue to be developed and improved
         | while maintaining those guarantees.
        
         | dilyevsky wrote:
         | There's a chapter on client-side throttling in the sre book -
         | https://sre.google/sre-book/handling-overload/
         | 
         | At google we also had to routinely do "backend drains" of
         | particular clusters when we deemed them unhealthy and they had
         | a system to do that quickly at the api/lb layer. At other
         | places I've also seen that done with application level flags so
         | you'd do kubectl edit which is obviously less than ideal but
         | worked
        
         | justapassenger wrote:
         | Implantation details will depend on your stack, but 3 main
         | things I'd keep in mind:
         | 
         | 1. Keep it simple. No elaborate logic. No complex data stores.
         | Just a simple checking of the flag.
         | 
         | 2. Do it as close to the source as possible, but have limited
         | trust in your clients - you may have old versions, things not
         | propagating, bugs, etc. So best to have option to degrade both
         | in the client and on the server. If you can do only one, so the
         | server side.
         | 
         | 3. Real world test! And test often. Don't trust test
         | environment. Test on real world traffic. Do periodic tests at
         | small scale (like 0.1% of traffic) but also do more full scale
         | tests on a schedule. If you didn't test it, it won't work when
         | you need it. If it worked a year ago, it will likely not work
         | now. If it's not tested, it'll likely cause more damage than
         | it'll solve.
        
           | tomcam wrote:
           | 4. Document procedures and methodically test the docs as well
        
         | dineshkumar_cs wrote:
         | Versioning, feature toggle and rollback - automated and
         | implemented at different level based on the system. It could be
         | an env configuration, or db field or down migration scripts or
         | deploying last working version.
        
         | anonacct37 wrote:
         | I worked at enough companies to see that many of them have some
         | notion of "centralized config that's rolled to the edge and can
         | be updated at runtime".
         | 
         | I've done this with djb's CDB (constant database). But I've
         | seen people poll an API for JSON config files or
         | dbm/gdbm/Berkeleydb/leveldb.
         | 
         | This can extend to other big red buttons. It's not that elegant
         | but I've had numerous services that checked for the presence of
         | a file to serve health checks. So pulling a node out of load
         | balancer rotation was as easy as creating a file.
         | 
         | The idea is that then when there's a datab outage the system
         | defaults to serving the last known good config.
        
         | masto wrote:
         | It depends.
         | 
         | To make up an example that doesn't depend on any of those
         | things: imagine that I've added a new feature to Hacker News
         | that allows users to display profile pictures next to their
         | comments. Of course, we have built everything around
         | microservices, so this is implemented by the frontend page
         | generator making a call to the profile service, which does a
         | lookup and responds with the image location. As part of the
         | launch plan, I document the "big red button" procedure to
         | follow if my new component is overloading the profile service
         | or image repository: run this command to rate-limit my
         | service's outgoing requests at the network layer (probably to 0
         | in an emergency). It will fail its lookups and the page
         | generator is designed to gracefully degrade by continuing to
         | render the comment text, sans profile photo.
         | 
         | (Before anyone hits send on that "what a stupid way to do X"
         | reply, please note that this is not an actual design doc, I'm
         | not giving advice on how to build anything, it's just a crayon
         | drawing to illustrate a point)
        
       | Dowwie wrote:
       | If you're using AWS resources, give LocalStack a try for
       | integration testing
        
         | whummer wrote:
         | 100% - Including some of the Chaos Engineering features that
         | are recently offered in the platform (e.g., simulating service
         | errors/latencies, region outages, etc)
        
         | galkk wrote:
         | Wife used localstack at previous job and it was miserable
         | experience. Especially their emulation of queues.
         | 
         | Maybe things have improved since couple years ago, though
        
       | fensterblick wrote:
       | Recently, I've heard of several companies folding up SRE and
       | moving individuals to their SWE teams. Rumors are that LinkedIn,
       | Adobe, and Robinhood have done this.
       | 
       | This made me think: is SRE a byproduct of a bubble economy of
       | easy money? Why not operate without the significant added expense
       | of SRE teams?
       | 
       | I wonder if SRE will be around 10 years from now, much like Sys
       | Admins and QA testers have mostly disappeared. Instead, many of
       | those functions are performed by software development teams.
        
         | kccqzy wrote:
         | SRE is not a byproduct of a bubble economy. I believe Google
         | has had SREs since the very beginning. But still I think the
         | rest of the point still stands. These days with devops the
         | skill set needed for devs have indeed expanded to have
         | significant overlap with SREs. I expect companies to downsize
         | their SRE teams and distribute responsibilities to devs.
         | 
         | A second major reason is automation. If you read the linked
         | site long enough you'll find that in the early days of Google,
         | SREs did plenty of manual work like deployments while manually
         | watching graphs. They were indispensable then simply because
         | even Google didn't have enough automation in their systems! You
         | can read the story of Sisyphus
         | https://www.usenix.org/sites/default/files/conference/protec...
         | to kind of understand how Google's initial failure of adopting
         | standardized automation ensured job security for SREs.
        
           | davedx wrote:
           | > "These days with devops the skill set needed for devs have
           | indeed expanded to have significant overlap with SREs"
           | 
           | Respectfully disagree on this. SRE is a huge complex realm
           | unto itself. Just understanding how all the cloud components
           | and environments and role systems work together is multiple
           | training courses, let alone how to reliably deploy and run in
           | them.
        
             | derefr wrote:
             | But modern approaches to dev require the SWEs to understand
             | and model the operation of their software, and in fact
             | program in terms of it -- "writing infrastructure" rather
             | than just code.
             | 
             | Lambda functions, for example: you have to understand their
             | performance and scalability characteristics -- in turn
             | requiring knowledge of things like the latency added by
             | crossing the boundary between a managed shared service
             | cluster and a VPC -- in order to understand how and where
             | to factor things into individual deployable functions.
        
           | dekhn wrote:
           | Pedantically, Google didn't have SREs as the beginning. I
           | asked a very early SRE, Lucas,
           | (https://www.nytimes.com/2002/11/28/technology/postcards-
           | from... and https://hackernoon.com/this-is-going-to-be-huge-
           | google-found...), and he said that in the early days, outages
           | would be really distracting to "the devs like Jeff and
           | Sanjay" and he and a few others ended up forming SRE to
           | handle site reliability more formally during the early days
           | of growth, when Google got a reputation for being fast and
           | scalable and nearly always up.
           | 
           | Lucas helped make one of my favorite Google Historical
           | Artefacts, a crayon chart of search volume. They had to
           | continuously rescale the graph in powers of ten due to
           | exponential growth.
           | 
           | I miss pre-IPO Google and the Internet of that time.
        
           | jeffrallen wrote:
           | The idea of ops people who wrote code for deployment and
           | monitoring and had responsibility for incident management and
           | change control existed before Google gave it a name.
           | 
           | Source: I was one at WebTV in 1996, and I worked with people
           | who did it at Xerox PARC and General Magic long before then.
        
         | vitalysh wrote:
         | I guess now they have a team of software engineers, where part
         | is focused on infra and part on backend. Sys Admins
         | disappeared? They are DevOps/IT Engineers now. QA? SWE in Test,
         | and so on.
        
         | vsareto wrote:
         | >Instead, many of those functions are performed by software
         | development teams.
         | 
         | And they likely won't be as good as dedicated SRE teams.
         | 
         | But few businesses care about that right now considering
         | layoffs.
         | 
         | Throwing developers at a problem even if it isn't in their
         | skill set is an industry trend that won't go away and be more
         | pronounced during downturns.
         | 
         | Full stack developers are a great example of rolling two roles
         | together without twice the pay.
        
           | davedx wrote:
           | Eh it's not the same thing. (I'm very full stack with
           | intermittent devops/sre experience).
           | 
           | Full stack means you write code running on back end and front
           | end. 99% of the time the code you write on the FE interfaces
           | with your other code for the BE. It's pretty coherent and
           | feedback loops are similar.
           | 
           | Devops/SRE on the other hand is very different and I agree we
           | shouldn't expect software developers be mixing in SRE in
           | their day to day. The skills, tools, mindset, feedback loop,
           | and stress levels are too different.
           | 
           | If you're not doing simple monoliths then you need a
           | dedicated devops/SRE team.
        
             | vsareto wrote:
             | If you can be good at front and back end and keep up with
             | both of them simultaneously, that's great, but:
             | 
             | - you spend more time to keep up with both of those sectors
             | compared to dedicated front or back end positions
             | 
             | - you context switch more often than dedicated positions
             | 
             | - you spent more time getting good at both of those things
             | 
             | - you removed some amount of communication overhead if
             | there were two positions
             | 
             | You are definitely not being compensated for that extra
             | work and benefit to the business given that full stack
             | salaries are close to front end and back end position
             | salaries.
        
         | sarchertech wrote:
         | I haven't noticed that in my corner of one of those mentioned
         | companies. Also I'm not an SRE, but during the height of the
         | recent tech layoffs the only job postings I was seeing was for
         | SRE.
        
         | gen220 wrote:
         | I imagine the threshold is something like 1 SRE for every $1mm
         | of high-margin revenue you can link to guaranteeing the 2nd "9"
         | of $product availability/reliability.
        
           | donalhunt wrote:
           | I believe that is indeed a good guide for when it makes sense
           | to have a SRE team supporting a service or product (with the
           | caveat that the number probably isn't $1MM).
           | 
           | There are also good patterns for ensuring you actually have
           | adequate SRE coverage for what the business needs. 2 x 6ppl
           | teams geo-graphically dispersed doing 7x12 shifts works
           | pretty well (not cheap). You can do it with less but you run
           | into more challenges when individuals leave / get burnt out /
           | etc.
        
         | VirusNewbie wrote:
         | >s SRE a byproduct of a bubble economy of easy money? Why not
         | operate without the significant added expense of SRE teams?
         | 
         | I'm a SWE SRE. I think in some cases it is better to be folded
         | into a team. In other cases, less so.
         | 
         | One SRE team can support _many_ different dev teams, and often
         | the dev teams are not at all focusing time on the very
         | complicated infra /distributed systems aspect of their job,
         | it's just not something that they worry about day to day.
         | 
         | So it makes sense to have an 'infra' that operates at a
         | different granularity than specialized dev teams.
         | 
         | That may or may not need to be called SRE, or maybe it's an SRE
         | SWE team, or maybe you just call it 'infrastructure' but at a
         | certain scale you have more cross cutting concerns across teams
         | where it's cheaper to split things out that way.
        
         | xorcist wrote:
         | Sys admins changed name to SREs which changed named to devops
         | engineers or cloud engineers or whatever the title is now.
         | 
         | Still the same competency. Someone needs to know how those
         | protocols work, tell you latency characteristics of storage,
         | and read those core dumps.
        
           | VirusNewbie wrote:
           | In my G SRE interview, I had to do the same rigorous Software
           | Engineering algorithms rounds as well as show deep
           | distributed systems knowledge in designing highly available
           | systems.
        
         | tayo42 wrote:
         | Two part to this,
         | 
         | Is sre a bubble thing.
         | 
         | I never got why SRE existed.(SRE has been my title...) The job
         | responsibilities, care about monitoring, logging, performance,
         | metrics of applications are all things a qualified developer
         | should be doing. Offloading caring about operating the software
         | someone writes to someone else just seems illogical to me. Put
         | the swes on call. If swes think the best way to do something is
         | manual, have them do it them selves, then fire them for being
         | terrible engineers. All these tedious interviews and a SWE
         | doesn't know how the computer they are programing works? Its
         | insane. All that schooling and things like how does the OS
         | work, which is part of an undergrad curriculum, gets offloaded
         | to a career and title mostly made up of self taught sysadmin
         | people? Every good swe Ive known, knew how the os, computer,
         | network works.
         | 
         | > if SRE will be around 10 years from now,
         | 
         | Other tasks that SRE typically does now, generalized
         | automation, provide dev tools and improve dev experience, is
         | being moved to "platform" and teams with those names. I expect
         | it to change significantly.
        
           | nrr wrote:
           | Oddly, the call to put the SWEs in the on-call rotation was
           | one of the original goals of site reliability engineering as
           | an institutional discipline. The idea at conception was that
           | SREs were expensive, and only after product teams got their
           | act together could they truly justify the cost of full-time
           | reliability engineering support.
           | 
           | It's only in the past 10 years (reasonable people may
           | disagree on that figure) that being a site reliability
           | engineer came to mean being something other than a
           | professional cranky jackass.
           | 
           | What I care about as an SRE is not graphs or performance or
           | even whether my pager stays silent (though, that would be
           | nice). No, I want the product teams to have good enough tools
           | (and, crucially, the knowledge behind them) to keep hitting
           | their goals.
           | 
           | Sometimes, frankly, the monitoring and performance get in the
           | way of that.
        
           | gen220 wrote:
           | > Other tasks that SRE typically does now, generalized
           | automation, provide dev tools and improve dev experience, is
           | being moved to "platform" and teams with those names. I
           | expect it to change significantly.
           | 
           | Yeah, this is my experience, too. "DevOps" (loosely, the
           | trend you describe in the first paragraph) is eating SRE from
           | one end and "Platform" from the other. SRE are basically
           | evolving into "System Engineers" responsible for operating
           | and sometimes developing common infrastructure and its
           | associated tools.
           | 
           | I don't think that's a bad thing at all! Platform engineering
           | is more fun, you're distributing the load of responsibility
           | in a way that's really sensible, and engineers who are
           | directly responsible for tracking regressions, performance,
           | and whatnot ime develop better products.
        
         | crabbone wrote:
         | > much like [...] QA testers have mostly disappeared.
         | 
         | Who told you that?
         | 
         | QA isn't going anywhere... someone is doing testing, and that
         | someone is a tester. They can be an s/w engineer by training,
         | but as long as they are testing they are a tester.
         | 
         | With sysadmins, there are fashion waves, where they keep being
         | called different names like DevOps or SRE. I've not heard of
         | such a thing with testing.
        
         | jeffbee wrote:
         | If you think of an SRE as an expensive sysadmin then yes, you
         | should absolutely scratch that entire org. SRE, by Google's
         | definition, is supposed to contain software engineers with deep
         | systems expertise, not some kind of less-qualified SWEs.
        
         | beoberha wrote:
         | I work for a large cloud service that is not Google where the
         | SRE culture varies heavily depending on which product you're
         | building. SREs are a necessity to free up devs to do actual dev
         | work. Platform and infra teams should tightly couple SWEs and
         | SREs to keep SWEs accountable, but not responsible for day to
         | day operations of the infra - you'll never get anything done :)
        
         | icedchai wrote:
         | SysAdmins didn't disappear, they just learned some cloud stuff
         | and changed titles. We call them "DevOps Engineers" now.
        
       | Smaug123 wrote:
       | For much _much_ more on this, I 'm most of the way through
       | Google's book _Building Secure and Reliable Systems_, which is a
       | proper textbook (not light reading). It's a pretty interesting
       | book! A lot of what it says is just common sense, but as the
       | saying goes, "common sense" is an oxymoron; it's felt useful to
       | have refreshed my knowledge of the whole thing at once.
        
       | benlivengood wrote:
       | Something I hope to eventually hear is the solution to the full
       | cold start problem. Most giant custom-stack companies have
       | circular dependencies on core infrastructure. Software-defined
       | networking needs some software running to start routing packets
       | again, diskless machines need some storage to boot from,
       | authentication services need access to storage to start handing
       | out service credentials to bootstrap secure authz, etc.
       | 
       | It's currently handled by running many independent regions so
       | that data centers can be brought up from fully dark by
       | bootstrapping them from existing infra. I haven't heard of anyone
       | bringing the stack up from a full power-off situation. Even when
       | Facebook completely broke its production network a couple years
       | ago the machines stayed on and running and had some internal
       | connectivity.
       | 
       | This matters to everyone because while cloud resources are great
       | at automatic restarts and fault recovery there's no guarantee
       | that AWS, GCP, and friends would come back up after, e.g., a
       | massive solar storm that knocks out the grid worldwide for long
       | enough to run the generators down.
       | 
       | My guess is that there are some dedicated small DCs with
       | exceptional backup power and the ability to be fully isolated
       | from grid surges (flywheel transformers or similar).
        
         | Gh0stRAT wrote:
         | Azure has procedures in place to prevent circular dependencies,
         | and regularly exercises them when bringing new regions online.
         | 
         | IIRC some of the information about their approach is considered
         | sensitive so I won't elaborate further.
        
         | throwawaaarrgh wrote:
         | The solution to this has to be done earlier, but it's simple:
         | start a habit of destroying and recreating everything. If you
         | wait to start doing this, it's very painful. If you start doing
         | it at the very beginning, you quickly get used to it, and
         | breaking changes and weird dependencies are caught early.
         | 
         | You can even do this with hardware. It changes how you
         | architect things, to deal with shit getting unplugged or reset.
         | You end up requiring more automation, version control and
         | change management, which speeds up and simplifies overall work,
         | in addition to preventing and quickly fixing outages. It's a
         | big culture shift.
        
         | jeffbee wrote:
         | When I was in Google SRE we had monitoring and enforcement of
         | permitted and forbidden RPC peers, such that a system that
         | attempted to use another system would fail or send alerts. This
         | was useful at the top of the stack to keep track of
         | dependencies silently added by library authors, and at the low
         | levels to ensure the things at the bottom of the stack were
         | really at the bottom. We also did virtual automated cluster
         | turn-up and turn-down, to make sure our documented procedures
         | did not get out of date, and in my 6 years in SRE I saw that
         | procedure fall from 90 days to under an hour. We also regularly
         | exercised the scratch restarts of things like global encryption
         | key management, which involves a physical object. The annual
         | DiRT exercise also tried to make sure that no person, team, or
         | office was necessary to the continuing function of systems.
        
         | jeffrallen wrote:
         | The power grid guys claim to have cold start plans locked and
         | loaded, but I'm not convinced they would work. Anyone seen an
         | after-action report saying how well a real grid cold start
         | went? It would also be interesting to know which grid has had
         | the most cold starts: in a perfect world, they'd be good at it
         | by now. Bet it's in the Caribbean or Africa. But it's funny:
         | small grid cold starts (i.e. an isolated island with one diesel
         | generator and some solar) are so easy they probably wouldn't
         | make good case studies.
         | 
         | It's clear that the Internet itself could not be cold started
         | like that AC grids, there's simply too many AS's. (Think about
         | what AS means for a second to understand why a coordinated,
         | rehearsed cold start is not possible.)
        
       | alexpotato wrote:
       | If you are interested in a similar list but with a bent towards
       | being a SRE for 15 years in FinTech/Banks/Hedge Funds/Crypto, let
       | me humbly suggest you check out:
       | 
       | https://x.com/alexpotato/status/1432302823383998471?s=20
       | 
       | Teaser: "25. If you have a rules engine where it's easier to make
       | a new rule than to find an existing rule based on filter
       | criteria: you will end up with lots of duplicate rules."
        
       | xyst wrote:
       | Off topic: TIL Google has its own TLD (.google)
        
         | DaiPlusPlus wrote:
         | So does .airbus, .barclays, ,mcrosoft, and .travelersinsurance
         | too - it's nothing new
         | 
         | https://data.iana.org/TLD/tlds-alpha-by-domain.txt
        
       | throwawaaarrgh wrote:
       | The cheapest way to prevent an outage is to catch it early in its
       | lifecycle. Software bugs are like real bugs. First is the egg,
       | that's the idea of the change. Then there's the nymph, when it
       | hatches; first POC. By the time it hits production, it's an
       | adult.
       | 
       | Wait - isn't there a stage before adulthood? Yes! Your app should
       | have several stages of maturity before it reaches adulthood. It's
       | far cheaper to find that bug before it becomes fully grown (and
       | starts laying its own eggs!)
       | 
       | If you can't do canaries and rollbacks are problematic, add more
       | testing before the production deploy. Linters, unit tests, end to
       | end tests, profilers, synthetic monitors, read-only copies of
       | production, performance tests, etc. Use as many ways as you can
       | to find the bug early.
       | 
       | Feature flags, backwards compatibility, and other methods are
       | also useful. But nothing beats Shift Left.
        
       ___________________________________________________________________
       (page generated 2023-10-27 23:00 UTC)