[HN Gopher] The Evolution of SRE at Google
       ___________________________________________________________________
        
       The Evolution of SRE at Google
        
       Author : r4um
       Score  : 101 points
       Date   : 2025-01-03 11:38 UTC (11 hours ago)
        
 (HTM) web link (www.usenix.org)
 (TXT) w3m dump (www.usenix.org)
        
       | 0xbadcafebee wrote:
       | They're doing that thing that happened to DevOps. It started out
       | as a guy who wanted a way for devs and sysadmins to talk about
       | deploys together, so they didn't get dead-cat syndrome. It ended
       | up as an entire branch of business management theory,
       | consultants, and a whole lot of ignorant people who think it just
       | means "a dev who does sysadmin tasks".
       | 
       | Abuse of a single word to mean too many things makes it
       | meaningless. SRE now has that distinction. You've got SREs who
       | (mostly) write software, SREs who (mostly) stare at graphs and
       | deal with random system failures, and now SREs who use a
       | framework to develop a complex risk model of multiple systems
       | (which is more quality control than engineering).
        
         | inquist wrote:
         | I think failure mode analysis is definitely part of engineering
        
         | ImPostingOnHN wrote:
         | Without resorting to any "big-D Devops" definition, I have
         | almost always seen devops referring to "supporting / running
         | the code you write", and have never encountered the definition
         | where dev and ops were 2 different roles. That was what things
         | were like before devops, and coordination on product support
         | and planning wasn't great, hence devops.
        
           | moandcompany wrote:
           | A lot of organizations simply renamed the functional area of
           | "systems administration" or "systems engineering" to
           | "DevOps," and at many of these places, "DevOps" is the new
           | name for the group that software developers will throw stuff
           | over the fence to.
           | 
           | The issue with the above names is that they can be applied to
           | a domain or area of practices, or an organizational boundary.
           | In a non-trivial number of organizations, "DevOps" is viewed
           | as a support entity for one or more software development
           | teams, versus software development teams practicing "devops."
           | 
           | This applies to many of the *_Ops names in fashion during the
           | past five years or so.
        
             | znpy wrote:
             | After almost 10 years in the systems engineering /
             | administration / devops / cloud etc space all I can say is:
             | 
             | The biggest improvement that devops brought is that it made
             | managers feel dumb, outdated and scared because they were
             | not "doing devops" while everybody else was, so they kinda
             | started listening to sysadmins and what they had to tell.
             | 
             | Uh, devops engineers did not come out of nowhere. They did
             | not come out of the ground like mushrooms. Most if not all
             | the "devops engineers" i know are just former sysadmins.
             | They were already willing to do whatever devops was
             | supposed to be, it's just they they were largely ignored.
             | 
             | Writing this I just realized that maybe the best way to
             | obtain organizational change is to make management and
             | upper management feel stupid and outdated. Interesting.
        
               | steveBK123 wrote:
               | Hence they have all hired "head of AI" in last 18 months.
        
           | stackskipton wrote:
           | I'm Ops type person so I work at companies where there is a
           | split between the two. Ops is a skill not all developers have
           | or frankly, not even mindset to properly do so you will need
           | a team/person to do it. Generally companies don't like the
           | cost of embedding Ops person into every team and that can
           | create redundant work so they form a DevOps/SRE team.
           | 
           | Good resource for different types of teams is here:
           | https://web.devopstopologies.com/
        
           | 0xbadcafebee wrote:
           | The reason you never encountered the second definition is
           | two-fold:
           | 
           | 1) There is no formal academic education behind the concept
           | (that I'm aware of). If you do a CS major, nobody's going to
           | explain to you the accumulated 15 years of practice and
           | knowledge around the concept.
           | 
           | 2) Due to 1), people just repeat what other people tell them.
           | It's like a long game of telephone. It turns out most
           | software development today is just a game of telephone
           | between devs (and now AI). So almost everyone is misinformed.
           | 
           | The Wikipedia page for DevOps is the best generic starting
           | point if you want to know more.
           | 
           | If you want to know more after that, there are a number of
           | books and blog posts. Jez Humble, John Willis, Gene Kim,
           | Patrick Debois, etc are the people to read. It's a much
           | larger body of knowledge than you might think. Almost none of
           | it has to do with devs supporting/running what they write
           | (that's a small subset of a larger category, and there's
           | multiple categories of 'stuff')
        
         | dilyevsky wrote:
         | > You've got SREs who (mostly) write software, SREs who
         | (mostly) stare at graphs and deal with random system failures,
         | and now SREs who use a framework to develop a complex risk
         | model of multiple systems (which is more quality control than
         | engineering).
         | 
         | This was always the case or at least going back 15 years or
         | more highlighted by the so called "treynor curve"
        
         | 01HNNWZ0MV43FF wrote:
         | Never heard of dead-cat syndrome. In case anyone else wonders:
         | 
         | > There is one thing that is absolutely certain about throwing
         | a dead cat on the dining room table - and I don't mean that
         | people will be outraged, alarmed, disgusted. That is true, but
         | irrelevant. The key point, says my Australian friend, is that
         | everyone will shout, "Jeez, mate, there's a dead cat on the
         | table!" In other words, they will be talking about the dead cat
         | - the thing you want them to talk about - and they will not be
         | talking about the issue that has been causing you so much grief
         | 
         | https://en.wikipedia.org/wiki/Dead_cat_strategy
        
           | johnkpaul wrote:
           | I actually don't think that's the dead-cat-saying that the
           | parent is referencing. I think that it's this concept
           | http://itskeptic.org/dead-cat-syndrome.html
           | 
           | I am also unfamiliar though and I'm reading up on it right
           | now.
        
             | 0xbadcafebee wrote:
             | That's the one. Only old fogies (like me) know it I guess.
             | It was the thing we all referred to as the impetus behind
             | DevOps, when it became a thing a decade ago.
        
       | qwertox wrote:
       | SRE == Site Reliability Engineering.
       | 
       | Quoting Wikipedia:
       | 
       | Site Reliability Engineering (SRE) is a discipline in the field
       | of Software Engineering that monitors and improves the
       | availability and performance of deployed software systems, often
       | large software services that are expected to deliver reliable
       | response times across events such as new software deployments,
       | hardware failures, and cybersecurity attacks[1]. There is
       | typically a focus on automation and an Infrastructure as code
       | methodology. SRE uses elements of software engineering, IT
       | infrastructure, web development, and operations[2] to assist with
       | reliability. It is similar to DevOps as they both aim to improve
       | the reliability and availability of deployed software systems.
       | 
       | https://en.wikipedia.org/wiki/Site_reliability_engineering
        
         | tetris11 wrote:
         | Thank you. Ridiculous that every other acronym was defined
         | except the one in the title..
        
         | doublerabbit wrote:
         | Modern day SysAdmin.
         | 
         | SysOp > SysAdmin > SRE
         | 
         | No different to what I've been doing for the past 15 years. Web
         | 2.0 needed a new buzzword is all.
         | 
         | SREs are System Admins who come from the development
         | background.
         | 
         | System Admins come from System Operators background.
         | 
         | Uptime is easy when people actually listen to what I have to
         | say or listen to NetOps.
         | 
         | Rather than DevOps throwing $next technology at everything or
         | "needing" 100x more X because their codebase lacks.
        
       | rzz3 wrote:
       | These days I wonder if Google is really the example to follow.
       | There was a time 10 or 15 years ago where Google seemed to be
       | leading the industry in everything, and I feel like a lot of
       | people still think they do when it comes to engineering culture.
       | These days I tend to see Google as a bit of a red flag on a
       | resume, and I have a set of questions I ask to make sure they
       | didn't drink too much of the koolaid. Perhaps more importantly,
       | when I look at Google from the outside these days, I see that
       | their products have really gone downhill in terms of quality. I
       | see Google Search riddled with spam, I see Gemini struggling to
       | keep up with OpenAI, Google Chat trying to keep up with Slack but
       | missing the mark, Nest being stagnant, I could go on and on. All
       | this to say that I don't think Google is the North Star that it
       | used to be in terms of guiding engineering culture throughout the
       | industry.
        
         | dehrmann wrote:
         | I agree on the product and customer service front, but Google's
         | reliability is top-notch.
        
           | mirashii wrote:
           | As a Google Cloud customer, I'd say it might be best to split
           | Google into some divisions or something, as Google Cloud's
           | reliability is a relative shitshow compared to Google.com.
        
         | pphysch wrote:
         | Who is then?
        
           | scarface_74 wrote:
           | From a product standpoint every BigTech company has done
           | better at releasing new products than Google.
        
             | pphysch wrote:
             | How did we get from SRE culture to (paraphrasing) "I
             | personally think Google makes worse products than IBM,
             | Oracle, Apple, Netflix, Broadcom, et al."
        
               | scarface_74 wrote:
               | Having good technology and good products are orthogonal.
               | People are conflating the two
        
               | hollowsunsets wrote:
               | What defines a good product? Something that many
               | customers use? Something that makes shareholders happy?
        
               | scarface_74 wrote:
               | A product that either moves the needle as far as revenue
               | and/or makes the ecosystem better. It also needs to be a
               | product that gets continuously better as long as there is
               | a market for it and not abandoned quickly.
               | 
               | - "a connected TV device". How many cancelled lines of
               | products have they abandoned? How many market failures
               | have they had in their own line of phones? The Pixel's
               | aren't taking the world by storm and they spent billions
               | on Motorola and then sold it off for scraps
               | 
               | They have been releasing a cancelling their own tablet
               | initiatives for years.
               | 
               | At one point they had 5 separate messaging initiatives
               | going on simultaneously.
               | 
               | Even today they have three operating system initiatives
               | that are not on the same codebase - Fuscia, Android and
               | ChromeOS.
               | 
               | They have basically abandoned Flutter and don't use it
               | for any of their high profile apps.
               | 
               | What have they actually done besides ads?
               | 
               | And the obvious evidence is their money losing "other
               | bets"
               | 
               | Also Google Fiber
               | 
               | https://www.spglobal.com/marketintelligence/en/news-
               | insights...
        
         | jofla_net wrote:
         | this came to mind
         | 
         | https://www.spiceworks.com/tech/data-management/news/google-...
        
           | sanj wrote:
           | Fixed about a week later:
           | 
           | https://support.google.com/drive/thread/245861992/drive-
           | for-...
        
             | znpy wrote:
             | it shouldn't have happened in the first place.
        
         | taeric wrote:
         | I'm curious what you have in mind for evidence of "koolaid"
         | there?
         | 
         | Hard not to disagree with the general trend you are outlining.
         | Most of that feels driven by product choices, moreso than
         | execution. I think a lot of the previous glorification of their
         | work was likely misguided, as well. But I would be hard pressed
         | to be quantitative on that.
        
           | scarface_74 wrote:
           | It was 5 years after Android was introduced that the CEO
           | stopped using BlackBerry...
        
             | taeric wrote:
             | An amusingly good quantification of some evidence. Well
             | done! :D
             | 
             | Still, I don't have much to say that I think the
             | engineering was overly good or bad. I typically think that
             | what they captured for a short while, at least, was
             | enthusiasm. In particular, developers were enthusiastic to
             | be near Google technology in a way that I don't think I've
             | seen for other companies, since.
             | 
             | I don't think they identified it as such, though. Which
             | could be why they seem slow to see that a lot of that has
             | evaporated.
             | 
             | Not to say that they have no enthusiasm, now. I'd wager
             | they still have a lot. But as a percentage share of all
             | developers, it feels very different.
        
         | scarface_74 wrote:
         | I would never hire a _product_ person from Google or someone I
         | needed to be visionary. For the most part, their products suck,
         | they have no vision and no follow through.
         | 
         | But their _technology_ is top notch. I hire mostly for startups
         | and green field initiatives though and I wouldn't hire anyone
         | from any BigTech company unless I had "hard" technical problems
         | to solve.
         | 
         | Yes I've done a stint at BigTech.
        
           | ninkendo wrote:
           | They have top notch tech, yes, but it's massively overkill
           | for literally every company that's not at google's scale. If
           | you're not careful you may hire someone who will try to
           | replicate everything google does, when you may need only 1%
           | of the complexity. This is the experience I've generally had
           | with xooglers... they lament that they don't have the same
           | tools/tech stack they had at google, and so their first act
           | is to try to move everything to the closest open source
           | equivalents, even if they're not a good fit.
           | 
           | There's good things and bad things to take away from
           | experience at google... you have to be careful to ignore
           | things that won't actually help you.
        
             | scarface_74 wrote:
             | I agree. I haven't run into a "hard problem" in my career
             | 
             | By hard problem I mean technically at the top 5% of a
             | problems in the industry that can't be solved by throwing
             | money at a SaaS or using a cloud provider.
        
             | deepsun wrote:
             | I've been the "you're not google" person for several years,
             | but now softened my position.
             | 
             | The thing is -- it depends. Sometimes when everyone knows
             | some complex system well -- it becomes easy.
             | 
             | One example comes to mind -- Kubernetes. 90% of teams don't
             | need all its complexity. And I've been "you don't need it"
             | person for some time. But now I see that when everyone
             | knows it -- it's actually much easier to deploy even simple
             | websites on it, because it's a common lingo and you don't
             | spend time explaining how it's deployed.
             | 
             | It's not like civic engineers, when an over-engineered
             | bridge would cost a lot more in materials.
        
               | scarface_74 wrote:
               | If you have a simple website , you can containerize your
               | backend and use much simpler services from AWS and serve
               | your static assets on S3.
               | 
               | Kubernetes is rarely the right answer for simple things
               | even if Docker is.
        
               | mschuster91 wrote:
               | > much simpler services from AWS
               | 
               | Like what, Lambda? I've seen so much horrible hacks and
               | shit done with it (and other AWS services _cough_ API
               | gateway _cough_ ), these days I rather prefer a set of
               | Kubernetes descriptors and Dockerfiles.
               | 
               | At least that combination all but _enforces_ people doing
               | Infrastructure-as-a-code and there 's (almost) no
               | possibility at all for "had to do live hack XYZ in the
               | console and forgot to document it or apply it back in
               | Terraform" .
        
               | scarface_74 wrote:
               | AWS App Runner
               | 
               | https://aws.amazon.com/blogs/containers/introducing-aws-
               | app-...
               | 
               | Google has something similar.
        
               | icedchai wrote:
               | GCP has Cloud Run, which looks similar. App Runner is
               | basically a wrapper on top of Fargate, right?
        
               | icedchai wrote:
               | In my experience, you are better off with ECS/Fargate
               | than Lambda for serving an API. You get much more
               | flexibility.
               | 
               | Also, I've witnessed people editing Lambda code through
               | the console instead of doing a real deploy. what a
               | mess...
        
           | jarsin wrote:
           | And now we are all stuck doing leetcode interviews primarily
           | because of Google.
        
             | marssaxman wrote:
             | Leetcode didn't exist back then; the site was founded a
             | little less than a decade ago.
        
         | brudgers wrote:
         | Unless you have Google sized problems and resources, Google
         | probably is not the best example because the things Google does
         | are done to address Google size problems with Google sized
         | resources. It's tooling and methods are not commercial
         | products.
         | 
         | For example, Google can get away with the flaws of it's AI
         | search results because it is Google.
        
         | jeffbee wrote:
         | The fact that some people prefer ChatGPT over Gemini is not
         | something that SRE can help you with. The fact that ChatGPT is
         | rarely available is something that SRE could help Microsoft
         | avoid.
        
           | lupire wrote:
           | ChatGPT is rarely available??
        
             | jeffbee wrote:
             | They have major, long-lasting incidents at least once a
             | week. https://status.openai.com/
        
         | yodsanklai wrote:
         | > There was a time 10 or 15 years ago where Google seemed to be
         | leading the industry in everything
         | 
         | They used to write interesting books and articles about
         | software engineering. It felt that they were maintaining high
         | quality standards and were an industry reference. Nowadays, I
         | wouldn't go as far as saying it's a red flag to have Google on
         | one's resume, but definitely not the same appeal as before.
        
       | jph wrote:
       | The article describes Causal Analysis based on Systems Theory
       | (CAST) which is akin to many-factor root cause analysis.
       | 
       | I am a big fan of CAST for software teams, and of MIT Prof. Nancy
       | Leveson who leads CAST.
       | 
       | My CAST summary notes for tech teams:
       | 
       | https://github.com/joelparkerhenderson/causal-analysis-based...
       | 
       | MIT CAST Handbook:
       | 
       | http://sunnyday.mit.edu/CAST-Handbook.pdf
        
         | pulkitsh1234 wrote:
         | Are there any resources to show how to apply this in practice?
         | This is too theoretical to grok for me, there are too many
         | terms. It seems too time-consuming to understand (and to
         | perform IMO)
        
           | jph wrote:
           | > This is too theoretical to grok for me
           | 
           | Here's a fast, easy, practical way to think about CAST:
           | 
           | 1. Causal: Novices may believe accidents are due to one "root
           | cause" or a few "probable causes", but it turns out that
           | accidents are actually due to many interacting causes.
           | 
           | 2. Analysis: Novices may blame people, but it's smarter to do
           | blame-free examination of why the loss occurred, and how it
           | occurred i.e. "ask why and how, not who".
           | 
           | 3. Systems: Novices may fix just one thing that broke, but it
           | turns out it's better to discover multiple causes, then plan
           | multiple ways to improve the whole system.
        
         | materielle wrote:
         | I was listening to a Titus Winters podcast, and I'm not sure he
         | exactly put it like this, but I took it away as:
         | 
         | There are two problems with automated testing. 1) tests take
         | too long to run 2) difficult to root cause breakages.
         | 
         | Most devs solve this with making unit tests ever more granular
         | with heavy use of mocks/fakes. This "solves" both problems in a
         | narrow sense: the tests run faster and are obvious to root
         | cause breakages.
         | 
         | But you didn't actually solve the problem. Since the entire
         | point of writing tests in the first place was to answer the
         | question: "does my system work"? Granular and mocked unit tests
         | don't help much.
         | 
         | However, going back to the original question, we can actually
         | reframe the problems as: 1) a work scheduling problem and 2) a
         | signal processing problem.
         | 
         | Those are pretty well understood problems with good solutions.
         | It's just that this is a somewhat novel way of thinking of
         | tests, so it hasn't really been integrated into the open source
         | tool chain.
         | 
         | You could imagine integration tests automatically be correlated
         | to a micro service release. Some CI automation constantly
         | running expensive tests over a range of commits and
         | automatically bisecting on failure. Etc.
         | 
         | Put another way, automated tests don't go far enough. We need
         | yet another higher layer of abstraction. Computers are better
         | at deciding what tests to run and when, and are also better at
         | interpreting the results.
        
           | azurelake wrote:
           | > Put another way, automated tests don't go far enough. We
           | need yet another higher layer of abstraction. Computers are
           | better at deciding what tests to run and when, and are also
           | better at interpreting the results.
           | 
           | Sounds like you might be interested in
           | https://antithesis.com/ (no affiliation).
        
       | MPSimmons wrote:
       | This reminds me very much of Sidney Dekker's work, particularly
       | The Field Guide to Understanding Human Failure, and Drift Into
       | Failure.
       | 
       | The former focuses on evaluating the system as a whole, and
       | identifying the state of mind of the participants of the
       | accidents and evaluating what led them to believe that they were
       | making the correct decisions, with the understanding that nobody
       | wants to crash a plane.
       | 
       | The latter book talks more about how multiple seemingly
       | independent changes to complex loosely coupled systems can
       | introduce gaps in safety coverage that aren't immediately
       | obvious, and how those things could be avoided.
       | 
       | I think the CAST approach looks appealing. It seems as though it
       | does require a lot of analysis of failures and near-misses to be
       | best utilized, and the hardest part of implementing it will
       | undoubtably be the people, who often take the "there wasn't a
       | failure, why should we spend time and energy investigating a
       | success" mindset.
        
       | FuriouslyAdrift wrote:
       | I think the single biggest thing about Google SREs (at least in
       | the early years) was that if your team was going to launch a new
       | product, you had to have an SRE to help and to maintain the
       | service.
       | 
       | Google deliberately limited the amount of SREs, so you had to
       | prove your stuff worked and sell it to the SRE to even get a
       | chance to launch.
       | 
       | Constraints help to make good ideas better...
        
         | hollowsunsets wrote:
         | It's not good when you have an SRE on hand to act as a
         | babysitter of sorts. That is how some companies use SREs these
         | days. They do the toil and sysadmin work so the product
         | engineers can focus on features. Exactly what we hoped to
         | avoid, but here we are.
        
         | arthurjj wrote:
         | Thanks for this detail, I worked at Google, with SREs, and
         | didn't know it. It seems like the type of 'design' detail that
         | might be more important than this entire article
        
         | emtel wrote:
         | This culture was, imo, directly responsible for google's
         | failure to launch a facebook competitor early enough for it to
         | matter.
         | 
         | The Orkut project was basically banned from being launched or
         | marketed as an official google product because it was deemed
         | "not production ready" by SRE. Despite that it gained huge
         | market share in Brazil and a few other countries before
         | eventually losing to FB. By the time their "production ready"
         | product (G+) launched it was hilariously late.
         | 
         | Facebook probably would have won anyway, but who knows what
         | might have happened if Google had actually leaned into this
         | very successful project instead of treating it like an unwanted
         | step-child.
        
       | crabbone wrote:
       | I wish this article was at most a quarter of its current length.
       | Preferably even shorter. There's so much self-congratulatory and
       | empty talk, it's really hard to get to the main point.
       | 
       | I think, the most important (and actually valuable) part is the
       | mention of the work done by someone else (STPA and CAST). That's
       | all there is to the article. Read about Causal Analysis based on
       | Systems Theory (CAST) and System-Theoretic Process Analysis
       | (STPA) do what the book says.
        
         | anal_reactor wrote:
         | Agreed that the whole article could've been much shorter.
         | Anyway, for me the key takeaway is not to trust your inputs.
         | It's true that code correctness often boils down to "given
         | input X, the program will correctly give output Y", but the
         | actual issue is that sometimes the input X itself might be
         | wrong. I think it's clearly visible in project management,
         | where people tell you one thing, you plan accordingly, then
         | later they do another thing, and if you haven't predicted this,
         | you're done. If this behavior is so common in human projects in
         | general, I see no reason why it wouldn't emerge in software
         | projects too.
         | 
         | The problem is, software that tries to do something smart with
         | inputs is much harder to reason about, which in turn increases
         | your likelihood of failure, which is exactly the thing you
         | wanted to avoid in the first place. For example, imagine you
         | have an edge case in your script where you want to perform "rm
         | -rf /" but the safety mechanism prevents you from doing this,
         | which effectively makes your script fail.
         | 
         | In conclusion, in my humble opinion, the most important part of
         | safety is choosing tools that are simplest to reason about. If
         | you have a bash script you're guaranteed to have some bug
         | related to some edge case - people managing POSIX realized that
         | bash is so fundamentally broken that it's better to forbid
         | certain filenames rather than fix bash. Use a Python library
         | for 10x the safety but half the comfort. If you have a C++
         | program it will leak memory no matter how hard you try. And so
         | on.
         | 
         | Similarly, when writing programs, you should give simple and
         | strong promises about its API. Don't ever do "program accepts
         | most sensible date strings and tries to parse that", do "it's
         | either this specific format or an error".
         | 
         | Verifying inputs and being smart about them is a good idea that
         | should be used carefully because it can backfire spectacularly.
        
       ___________________________________________________________________
       (page generated 2025-01-03 23:00 UTC)