[HN Gopher] The Evolution of SRE at Google
___________________________________________________________________
The Evolution of SRE at Google
Author : r4um
Score : 101 points
Date : 2025-01-03 11:38 UTC (11 hours ago)
(HTM) web link (www.usenix.org)
(TXT) w3m dump (www.usenix.org)
| 0xbadcafebee wrote:
| They're doing that thing that happened to DevOps. It started out
| as a guy who wanted a way for devs and sysadmins to talk about
| deploys together, so they didn't get dead-cat syndrome. It ended
| up as an entire branch of business management theory,
| consultants, and a whole lot of ignorant people who think it just
| means "a dev who does sysadmin tasks".
|
| Abuse of a single word to mean too many things makes it
| meaningless. SRE now has that distinction. You've got SREs who
| (mostly) write software, SREs who (mostly) stare at graphs and
| deal with random system failures, and now SREs who use a
| framework to develop a complex risk model of multiple systems
| (which is more quality control than engineering).
| inquist wrote:
| I think failure mode analysis is definitely part of engineering
| ImPostingOnHN wrote:
| Without resorting to any "big-D Devops" definition, I have
| almost always seen devops referring to "supporting / running
| the code you write", and have never encountered the definition
| where dev and ops were 2 different roles. That was what things
| were like before devops, and coordination on product support
| and planning wasn't great, hence devops.
| moandcompany wrote:
| A lot of organizations simply renamed the functional area of
| "systems administration" or "systems engineering" to
| "DevOps," and at many of these places, "DevOps" is the new
| name for the group that software developers will throw stuff
| over the fence to.
|
| The issue with the above names is that they can be applied to
| a domain or area of practices, or an organizational boundary.
| In a non-trivial number of organizations, "DevOps" is viewed
| as a support entity for one or more software development
| teams, versus software development teams practicing "devops."
|
| This applies to many of the *_Ops names in fashion during the
| past five years or so.
| znpy wrote:
| After almost 10 years in the systems engineering /
| administration / devops / cloud etc space all I can say is:
|
| The biggest improvement that devops brought is that it made
| managers feel dumb, outdated and scared because they were
| not "doing devops" while everybody else was, so they kinda
| started listening to sysadmins and what they had to tell.
|
| Uh, devops engineers did not come out of nowhere. They did
| not come out of the ground like mushrooms. Most if not all
| the "devops engineers" i know are just former sysadmins.
| They were already willing to do whatever devops was
| supposed to be, it's just they they were largely ignored.
|
| Writing this I just realized that maybe the best way to
| obtain organizational change is to make management and
| upper management feel stupid and outdated. Interesting.
| steveBK123 wrote:
| Hence they have all hired "head of AI" in last 18 months.
| stackskipton wrote:
| I'm Ops type person so I work at companies where there is a
| split between the two. Ops is a skill not all developers have
| or frankly, not even mindset to properly do so you will need
| a team/person to do it. Generally companies don't like the
| cost of embedding Ops person into every team and that can
| create redundant work so they form a DevOps/SRE team.
|
| Good resource for different types of teams is here:
| https://web.devopstopologies.com/
| 0xbadcafebee wrote:
| The reason you never encountered the second definition is
| two-fold:
|
| 1) There is no formal academic education behind the concept
| (that I'm aware of). If you do a CS major, nobody's going to
| explain to you the accumulated 15 years of practice and
| knowledge around the concept.
|
| 2) Due to 1), people just repeat what other people tell them.
| It's like a long game of telephone. It turns out most
| software development today is just a game of telephone
| between devs (and now AI). So almost everyone is misinformed.
|
| The Wikipedia page for DevOps is the best generic starting
| point if you want to know more.
|
| If you want to know more after that, there are a number of
| books and blog posts. Jez Humble, John Willis, Gene Kim,
| Patrick Debois, etc are the people to read. It's a much
| larger body of knowledge than you might think. Almost none of
| it has to do with devs supporting/running what they write
| (that's a small subset of a larger category, and there's
| multiple categories of 'stuff')
| dilyevsky wrote:
| > You've got SREs who (mostly) write software, SREs who
| (mostly) stare at graphs and deal with random system failures,
| and now SREs who use a framework to develop a complex risk
| model of multiple systems (which is more quality control than
| engineering).
|
| This was always the case or at least going back 15 years or
| more highlighted by the so called "treynor curve"
| 01HNNWZ0MV43FF wrote:
| Never heard of dead-cat syndrome. In case anyone else wonders:
|
| > There is one thing that is absolutely certain about throwing
| a dead cat on the dining room table - and I don't mean that
| people will be outraged, alarmed, disgusted. That is true, but
| irrelevant. The key point, says my Australian friend, is that
| everyone will shout, "Jeez, mate, there's a dead cat on the
| table!" In other words, they will be talking about the dead cat
| - the thing you want them to talk about - and they will not be
| talking about the issue that has been causing you so much grief
|
| https://en.wikipedia.org/wiki/Dead_cat_strategy
| johnkpaul wrote:
| I actually don't think that's the dead-cat-saying that the
| parent is referencing. I think that it's this concept
| http://itskeptic.org/dead-cat-syndrome.html
|
| I am also unfamiliar though and I'm reading up on it right
| now.
| 0xbadcafebee wrote:
| That's the one. Only old fogies (like me) know it I guess.
| It was the thing we all referred to as the impetus behind
| DevOps, when it became a thing a decade ago.
| qwertox wrote:
| SRE == Site Reliability Engineering.
|
| Quoting Wikipedia:
|
| Site Reliability Engineering (SRE) is a discipline in the field
| of Software Engineering that monitors and improves the
| availability and performance of deployed software systems, often
| large software services that are expected to deliver reliable
| response times across events such as new software deployments,
| hardware failures, and cybersecurity attacks[1]. There is
| typically a focus on automation and an Infrastructure as code
| methodology. SRE uses elements of software engineering, IT
| infrastructure, web development, and operations[2] to assist with
| reliability. It is similar to DevOps as they both aim to improve
| the reliability and availability of deployed software systems.
|
| https://en.wikipedia.org/wiki/Site_reliability_engineering
| tetris11 wrote:
| Thank you. Ridiculous that every other acronym was defined
| except the one in the title..
| doublerabbit wrote:
| Modern day SysAdmin.
|
| SysOp > SysAdmin > SRE
|
| No different to what I've been doing for the past 15 years. Web
| 2.0 needed a new buzzword is all.
|
| SREs are System Admins who come from the development
| background.
|
| System Admins come from System Operators background.
|
| Uptime is easy when people actually listen to what I have to
| say or listen to NetOps.
|
| Rather than DevOps throwing $next technology at everything or
| "needing" 100x more X because their codebase lacks.
| rzz3 wrote:
| These days I wonder if Google is really the example to follow.
| There was a time 10 or 15 years ago where Google seemed to be
| leading the industry in everything, and I feel like a lot of
| people still think they do when it comes to engineering culture.
| These days I tend to see Google as a bit of a red flag on a
| resume, and I have a set of questions I ask to make sure they
| didn't drink too much of the koolaid. Perhaps more importantly,
| when I look at Google from the outside these days, I see that
| their products have really gone downhill in terms of quality. I
| see Google Search riddled with spam, I see Gemini struggling to
| keep up with OpenAI, Google Chat trying to keep up with Slack but
| missing the mark, Nest being stagnant, I could go on and on. All
| this to say that I don't think Google is the North Star that it
| used to be in terms of guiding engineering culture throughout the
| industry.
| dehrmann wrote:
| I agree on the product and customer service front, but Google's
| reliability is top-notch.
| mirashii wrote:
| As a Google Cloud customer, I'd say it might be best to split
| Google into some divisions or something, as Google Cloud's
| reliability is a relative shitshow compared to Google.com.
| pphysch wrote:
| Who is then?
| scarface_74 wrote:
| From a product standpoint every BigTech company has done
| better at releasing new products than Google.
| pphysch wrote:
| How did we get from SRE culture to (paraphrasing) "I
| personally think Google makes worse products than IBM,
| Oracle, Apple, Netflix, Broadcom, et al."
| scarface_74 wrote:
| Having good technology and good products are orthogonal.
| People are conflating the two
| hollowsunsets wrote:
| What defines a good product? Something that many
| customers use? Something that makes shareholders happy?
| scarface_74 wrote:
| A product that either moves the needle as far as revenue
| and/or makes the ecosystem better. It also needs to be a
| product that gets continuously better as long as there is
| a market for it and not abandoned quickly.
|
| - "a connected TV device". How many cancelled lines of
| products have they abandoned? How many market failures
| have they had in their own line of phones? The Pixel's
| aren't taking the world by storm and they spent billions
| on Motorola and then sold it off for scraps
|
| They have been releasing a cancelling their own tablet
| initiatives for years.
|
| At one point they had 5 separate messaging initiatives
| going on simultaneously.
|
| Even today they have three operating system initiatives
| that are not on the same codebase - Fuscia, Android and
| ChromeOS.
|
| They have basically abandoned Flutter and don't use it
| for any of their high profile apps.
|
| What have they actually done besides ads?
|
| And the obvious evidence is their money losing "other
| bets"
|
| Also Google Fiber
|
| https://www.spglobal.com/marketintelligence/en/news-
| insights...
| jofla_net wrote:
| this came to mind
|
| https://www.spiceworks.com/tech/data-management/news/google-...
| sanj wrote:
| Fixed about a week later:
|
| https://support.google.com/drive/thread/245861992/drive-
| for-...
| znpy wrote:
| it shouldn't have happened in the first place.
| taeric wrote:
| I'm curious what you have in mind for evidence of "koolaid"
| there?
|
| Hard not to disagree with the general trend you are outlining.
| Most of that feels driven by product choices, moreso than
| execution. I think a lot of the previous glorification of their
| work was likely misguided, as well. But I would be hard pressed
| to be quantitative on that.
| scarface_74 wrote:
| It was 5 years after Android was introduced that the CEO
| stopped using BlackBerry...
| taeric wrote:
| An amusingly good quantification of some evidence. Well
| done! :D
|
| Still, I don't have much to say that I think the
| engineering was overly good or bad. I typically think that
| what they captured for a short while, at least, was
| enthusiasm. In particular, developers were enthusiastic to
| be near Google technology in a way that I don't think I've
| seen for other companies, since.
|
| I don't think they identified it as such, though. Which
| could be why they seem slow to see that a lot of that has
| evaporated.
|
| Not to say that they have no enthusiasm, now. I'd wager
| they still have a lot. But as a percentage share of all
| developers, it feels very different.
| scarface_74 wrote:
| I would never hire a _product_ person from Google or someone I
| needed to be visionary. For the most part, their products suck,
| they have no vision and no follow through.
|
| But their _technology_ is top notch. I hire mostly for startups
| and green field initiatives though and I wouldn't hire anyone
| from any BigTech company unless I had "hard" technical problems
| to solve.
|
| Yes I've done a stint at BigTech.
| ninkendo wrote:
| They have top notch tech, yes, but it's massively overkill
| for literally every company that's not at google's scale. If
| you're not careful you may hire someone who will try to
| replicate everything google does, when you may need only 1%
| of the complexity. This is the experience I've generally had
| with xooglers... they lament that they don't have the same
| tools/tech stack they had at google, and so their first act
| is to try to move everything to the closest open source
| equivalents, even if they're not a good fit.
|
| There's good things and bad things to take away from
| experience at google... you have to be careful to ignore
| things that won't actually help you.
| scarface_74 wrote:
| I agree. I haven't run into a "hard problem" in my career
|
| By hard problem I mean technically at the top 5% of a
| problems in the industry that can't be solved by throwing
| money at a SaaS or using a cloud provider.
| deepsun wrote:
| I've been the "you're not google" person for several years,
| but now softened my position.
|
| The thing is -- it depends. Sometimes when everyone knows
| some complex system well -- it becomes easy.
|
| One example comes to mind -- Kubernetes. 90% of teams don't
| need all its complexity. And I've been "you don't need it"
| person for some time. But now I see that when everyone
| knows it -- it's actually much easier to deploy even simple
| websites on it, because it's a common lingo and you don't
| spend time explaining how it's deployed.
|
| It's not like civic engineers, when an over-engineered
| bridge would cost a lot more in materials.
| scarface_74 wrote:
| If you have a simple website , you can containerize your
| backend and use much simpler services from AWS and serve
| your static assets on S3.
|
| Kubernetes is rarely the right answer for simple things
| even if Docker is.
| mschuster91 wrote:
| > much simpler services from AWS
|
| Like what, Lambda? I've seen so much horrible hacks and
| shit done with it (and other AWS services _cough_ API
| gateway _cough_ ), these days I rather prefer a set of
| Kubernetes descriptors and Dockerfiles.
|
| At least that combination all but _enforces_ people doing
| Infrastructure-as-a-code and there 's (almost) no
| possibility at all for "had to do live hack XYZ in the
| console and forgot to document it or apply it back in
| Terraform" .
| scarface_74 wrote:
| AWS App Runner
|
| https://aws.amazon.com/blogs/containers/introducing-aws-
| app-...
|
| Google has something similar.
| icedchai wrote:
| GCP has Cloud Run, which looks similar. App Runner is
| basically a wrapper on top of Fargate, right?
| icedchai wrote:
| In my experience, you are better off with ECS/Fargate
| than Lambda for serving an API. You get much more
| flexibility.
|
| Also, I've witnessed people editing Lambda code through
| the console instead of doing a real deploy. what a
| mess...
| jarsin wrote:
| And now we are all stuck doing leetcode interviews primarily
| because of Google.
| marssaxman wrote:
| Leetcode didn't exist back then; the site was founded a
| little less than a decade ago.
| brudgers wrote:
| Unless you have Google sized problems and resources, Google
| probably is not the best example because the things Google does
| are done to address Google size problems with Google sized
| resources. It's tooling and methods are not commercial
| products.
|
| For example, Google can get away with the flaws of it's AI
| search results because it is Google.
| jeffbee wrote:
| The fact that some people prefer ChatGPT over Gemini is not
| something that SRE can help you with. The fact that ChatGPT is
| rarely available is something that SRE could help Microsoft
| avoid.
| lupire wrote:
| ChatGPT is rarely available??
| jeffbee wrote:
| They have major, long-lasting incidents at least once a
| week. https://status.openai.com/
| yodsanklai wrote:
| > There was a time 10 or 15 years ago where Google seemed to be
| leading the industry in everything
|
| They used to write interesting books and articles about
| software engineering. It felt that they were maintaining high
| quality standards and were an industry reference. Nowadays, I
| wouldn't go as far as saying it's a red flag to have Google on
| one's resume, but definitely not the same appeal as before.
| jph wrote:
| The article describes Causal Analysis based on Systems Theory
| (CAST) which is akin to many-factor root cause analysis.
|
| I am a big fan of CAST for software teams, and of MIT Prof. Nancy
| Leveson who leads CAST.
|
| My CAST summary notes for tech teams:
|
| https://github.com/joelparkerhenderson/causal-analysis-based...
|
| MIT CAST Handbook:
|
| http://sunnyday.mit.edu/CAST-Handbook.pdf
| pulkitsh1234 wrote:
| Are there any resources to show how to apply this in practice?
| This is too theoretical to grok for me, there are too many
| terms. It seems too time-consuming to understand (and to
| perform IMO)
| jph wrote:
| > This is too theoretical to grok for me
|
| Here's a fast, easy, practical way to think about CAST:
|
| 1. Causal: Novices may believe accidents are due to one "root
| cause" or a few "probable causes", but it turns out that
| accidents are actually due to many interacting causes.
|
| 2. Analysis: Novices may blame people, but it's smarter to do
| blame-free examination of why the loss occurred, and how it
| occurred i.e. "ask why and how, not who".
|
| 3. Systems: Novices may fix just one thing that broke, but it
| turns out it's better to discover multiple causes, then plan
| multiple ways to improve the whole system.
| materielle wrote:
| I was listening to a Titus Winters podcast, and I'm not sure he
| exactly put it like this, but I took it away as:
|
| There are two problems with automated testing. 1) tests take
| too long to run 2) difficult to root cause breakages.
|
| Most devs solve this with making unit tests ever more granular
| with heavy use of mocks/fakes. This "solves" both problems in a
| narrow sense: the tests run faster and are obvious to root
| cause breakages.
|
| But you didn't actually solve the problem. Since the entire
| point of writing tests in the first place was to answer the
| question: "does my system work"? Granular and mocked unit tests
| don't help much.
|
| However, going back to the original question, we can actually
| reframe the problems as: 1) a work scheduling problem and 2) a
| signal processing problem.
|
| Those are pretty well understood problems with good solutions.
| It's just that this is a somewhat novel way of thinking of
| tests, so it hasn't really been integrated into the open source
| tool chain.
|
| You could imagine integration tests automatically be correlated
| to a micro service release. Some CI automation constantly
| running expensive tests over a range of commits and
| automatically bisecting on failure. Etc.
|
| Put another way, automated tests don't go far enough. We need
| yet another higher layer of abstraction. Computers are better
| at deciding what tests to run and when, and are also better at
| interpreting the results.
| azurelake wrote:
| > Put another way, automated tests don't go far enough. We
| need yet another higher layer of abstraction. Computers are
| better at deciding what tests to run and when, and are also
| better at interpreting the results.
|
| Sounds like you might be interested in
| https://antithesis.com/ (no affiliation).
| MPSimmons wrote:
| This reminds me very much of Sidney Dekker's work, particularly
| The Field Guide to Understanding Human Failure, and Drift Into
| Failure.
|
| The former focuses on evaluating the system as a whole, and
| identifying the state of mind of the participants of the
| accidents and evaluating what led them to believe that they were
| making the correct decisions, with the understanding that nobody
| wants to crash a plane.
|
| The latter book talks more about how multiple seemingly
| independent changes to complex loosely coupled systems can
| introduce gaps in safety coverage that aren't immediately
| obvious, and how those things could be avoided.
|
| I think the CAST approach looks appealing. It seems as though it
| does require a lot of analysis of failures and near-misses to be
| best utilized, and the hardest part of implementing it will
| undoubtably be the people, who often take the "there wasn't a
| failure, why should we spend time and energy investigating a
| success" mindset.
| FuriouslyAdrift wrote:
| I think the single biggest thing about Google SREs (at least in
| the early years) was that if your team was going to launch a new
| product, you had to have an SRE to help and to maintain the
| service.
|
| Google deliberately limited the amount of SREs, so you had to
| prove your stuff worked and sell it to the SRE to even get a
| chance to launch.
|
| Constraints help to make good ideas better...
| hollowsunsets wrote:
| It's not good when you have an SRE on hand to act as a
| babysitter of sorts. That is how some companies use SREs these
| days. They do the toil and sysadmin work so the product
| engineers can focus on features. Exactly what we hoped to
| avoid, but here we are.
| arthurjj wrote:
| Thanks for this detail, I worked at Google, with SREs, and
| didn't know it. It seems like the type of 'design' detail that
| might be more important than this entire article
| emtel wrote:
| This culture was, imo, directly responsible for google's
| failure to launch a facebook competitor early enough for it to
| matter.
|
| The Orkut project was basically banned from being launched or
| marketed as an official google product because it was deemed
| "not production ready" by SRE. Despite that it gained huge
| market share in Brazil and a few other countries before
| eventually losing to FB. By the time their "production ready"
| product (G+) launched it was hilariously late.
|
| Facebook probably would have won anyway, but who knows what
| might have happened if Google had actually leaned into this
| very successful project instead of treating it like an unwanted
| step-child.
| crabbone wrote:
| I wish this article was at most a quarter of its current length.
| Preferably even shorter. There's so much self-congratulatory and
| empty talk, it's really hard to get to the main point.
|
| I think, the most important (and actually valuable) part is the
| mention of the work done by someone else (STPA and CAST). That's
| all there is to the article. Read about Causal Analysis based on
| Systems Theory (CAST) and System-Theoretic Process Analysis
| (STPA) do what the book says.
| anal_reactor wrote:
| Agreed that the whole article could've been much shorter.
| Anyway, for me the key takeaway is not to trust your inputs.
| It's true that code correctness often boils down to "given
| input X, the program will correctly give output Y", but the
| actual issue is that sometimes the input X itself might be
| wrong. I think it's clearly visible in project management,
| where people tell you one thing, you plan accordingly, then
| later they do another thing, and if you haven't predicted this,
| you're done. If this behavior is so common in human projects in
| general, I see no reason why it wouldn't emerge in software
| projects too.
|
| The problem is, software that tries to do something smart with
| inputs is much harder to reason about, which in turn increases
| your likelihood of failure, which is exactly the thing you
| wanted to avoid in the first place. For example, imagine you
| have an edge case in your script where you want to perform "rm
| -rf /" but the safety mechanism prevents you from doing this,
| which effectively makes your script fail.
|
| In conclusion, in my humble opinion, the most important part of
| safety is choosing tools that are simplest to reason about. If
| you have a bash script you're guaranteed to have some bug
| related to some edge case - people managing POSIX realized that
| bash is so fundamentally broken that it's better to forbid
| certain filenames rather than fix bash. Use a Python library
| for 10x the safety but half the comfort. If you have a C++
| program it will leak memory no matter how hard you try. And so
| on.
|
| Similarly, when writing programs, you should give simple and
| strong promises about its API. Don't ever do "program accepts
| most sensible date strings and tries to parse that", do "it's
| either this specific format or an error".
|
| Verifying inputs and being smart about them is a good idea that
| should be used carefully because it can backfire spectacularly.
___________________________________________________________________
(page generated 2025-01-03 23:00 UTC)