[HN Gopher] The Evolution of SRE at Google
___________________________________________________________________
The Evolution of SRE at Google
Author : r4um
Score : 277 points
Date : 2025-01-03 11:38 UTC (1 days ago)
(HTM) web link (www.usenix.org)
(TXT) w3m dump (www.usenix.org)
| 0xbadcafebee wrote:
| They're doing that thing that happened to DevOps. It started out
| as a guy who wanted a way for devs and sysadmins to talk about
| deploys together, so they didn't get dead-cat syndrome. It ended
| up as an entire branch of business management theory,
| consultants, and a whole lot of ignorant people who think it just
| means "a dev who does sysadmin tasks".
|
| Abuse of a single word to mean too many things makes it
| meaningless. SRE now has that distinction. You've got SREs who
| (mostly) write software, SREs who (mostly) stare at graphs and
| deal with random system failures, and now SREs who use a
| framework to develop a complex risk model of multiple systems
| (which is more quality control than engineering).
| inquist wrote:
| I think failure mode analysis is definitely part of engineering
| ImPostingOnHN wrote:
| Without resorting to any "big-D Devops" definition, I have
| almost always seen devops referring to "supporting / running
| the code you write", and have never encountered the definition
| where dev and ops were 2 different roles. That was what things
| were like before devops, and coordination on product support
| and planning wasn't great, hence devops.
| moandcompany wrote:
| A lot of organizations simply renamed the functional area of
| "systems administration" or "systems engineering" to
| "DevOps," and at many of these places, "DevOps" is the new
| name for the group that software developers will throw stuff
| over the fence to.
|
| The issue with the above names is that they can be applied to
| a domain or area of practices, or an organizational boundary.
| In a non-trivial number of organizations, "DevOps" is viewed
| as a support entity for one or more software development
| teams, versus software development teams practicing "devops."
|
| This applies to many of the *_Ops names in fashion during the
| past five years or so.
| znpy wrote:
| After almost 10 years in the systems engineering /
| administration / devops / cloud etc space all I can say is:
|
| The biggest improvement that devops brought is that it made
| managers feel dumb, outdated and scared because they were
| not "doing devops" while everybody else was, so they kinda
| started listening to sysadmins and what they had to tell.
|
| Uh, devops engineers did not come out of nowhere. They did
| not come out of the ground like mushrooms. Most if not all
| the "devops engineers" i know are just former sysadmins.
| They were already willing to do whatever devops was
| supposed to be, it's just they they were largely ignored.
|
| Writing this I just realized that maybe the best way to
| obtain organizational change is to make management and
| upper management feel stupid and outdated. Interesting.
| steveBK123 wrote:
| Hence they have all hired "head of AI" in last 18 months.
| moandcompany wrote:
| "What is your AI strategy?" -> "We're hiring a Head of
| AI"
| steveBK123 wrote:
| It's the 2020s version of the old IBM line... No one ever
| got fired for hiring a head of AI.
| stackskipton wrote:
| I'm Ops type person so I work at companies where there is a
| split between the two. Ops is a skill not all developers have
| or frankly, not even mindset to properly do so you will need
| a team/person to do it. Generally companies don't like the
| cost of embedding Ops person into every team and that can
| create redundant work so they form a DevOps/SRE team.
|
| Good resource for different types of teams is here:
| https://web.devopstopologies.com/
| 0xbadcafebee wrote:
| The reason you never encountered the second definition is
| two-fold:
|
| 1) There is no formal academic education behind the concept
| (that I'm aware of). If you do a CS major, nobody's going to
| explain to you the accumulated 15 years of practice and
| knowledge around the concept.
|
| 2) Due to 1), people just repeat what other people tell them.
| It's like a long game of telephone. It turns out most
| software development today is just a game of telephone
| between devs (and now AI). So almost everyone is misinformed.
|
| The Wikipedia page for DevOps is the best generic starting
| point if you want to know more.
|
| If you want to know more after that, there are a number of
| books and blog posts. Jez Humble, John Willis, Gene Kim,
| Patrick Debois, etc are the people to read. It's a much
| larger body of knowledge than you might think. Almost none of
| it has to do with devs supporting/running what they write
| (that's a small subset of a larger category, and there's
| multiple categories of 'stuff')
| dilyevsky wrote:
| > You've got SREs who (mostly) write software, SREs who
| (mostly) stare at graphs and deal with random system failures,
| and now SREs who use a framework to develop a complex risk
| model of multiple systems (which is more quality control than
| engineering).
|
| This was always the case or at least going back 15 years or
| more highlighted by the so called "treynor curve"
| pmb wrote:
| NB: The Treynor Curve is named after Ben Treynor and his
| ideas. Ben Treynor's name changed to Ben Sloss a few years
| back, and Ben Sloss is one of the authors of this article.
| 01HNNWZ0MV43FF wrote:
| Never heard of dead-cat syndrome. In case anyone else wonders:
|
| > There is one thing that is absolutely certain about throwing
| a dead cat on the dining room table - and I don't mean that
| people will be outraged, alarmed, disgusted. That is true, but
| irrelevant. The key point, says my Australian friend, is that
| everyone will shout, "Jeez, mate, there's a dead cat on the
| table!" In other words, they will be talking about the dead cat
| - the thing you want them to talk about - and they will not be
| talking about the issue that has been causing you so much grief
|
| https://en.wikipedia.org/wiki/Dead_cat_strategy
| johnkpaul wrote:
| I actually don't think that's the dead-cat-saying that the
| parent is referencing. I think that it's this concept
| http://itskeptic.org/dead-cat-syndrome.html
|
| I am also unfamiliar though and I'm reading up on it right
| now.
| 0xbadcafebee wrote:
| That's the one. Only old fogies (like me) know it I guess.
| It was the thing we all referred to as the impetus behind
| DevOps, when it became a thing a decade ago.
| dijksterhuis wrote:
| i'm not an old fogie and i kind of guessed "throwing a
| dead cat over the wall" was what you meant.
|
| although i was actually taught DevOps proper before it
| got super fashionable.
|
| maybe that makes me an old fogie nowdays? :/
| paulddraper wrote:
| > devs and sysadmins to talk about deploys together
|
| Not quite, it was the merger of development and operations
| responsibilities.
|
| I.e. devs who would sysadmin / sysadmins who would development.
| zsoltkacsandi wrote:
| > a way for devs and sysadmins to talk about deploys together
|
| My take on this as someone who has 14 years of dev and 7 years
| of ops experience: DevOps is a flawed concept.
|
| The problem never was the lack of communication between the
| devs and sysadmins, it's just a symptom.
|
| The root cause is that the management puts pressure on the devs
| to innovate and deliver as fast as possible, and puts pressure
| on the ops to ensure that the system is stable, reliable,
| scalable, it has a 99.95% uptime and any issues will be solved
| by the on-call.
|
| So these two groups have conflicting interests and when this
| leads to conflicts and arguments the conclusion is that they
| just don't want to collaborate/communicate.
|
| There are many departments at a company that have conflicting
| interests and can interfere with each other. If DevOps was a
| real thing there would be a need for LegalOps, DevSales,
| DevProduct, HROps, etc..
| fragmede wrote:
| Paralegals, legal clerks; sales engineers; product
| management, project management, TPMs; HR coordinator, HR
| administrative assistant...
|
| I'm not sure your examples say we don't need DevOps!
| zsoltkacsandi wrote:
| > Paralegals, legal clerks; sales engineers; product
| management, project management, TPMs; HR coordinator, HR
| administrative assistant
|
| The point wasn't that there are multiple job titles in
| different fields.
|
| > I'm not sure your examples say we don't need DevOps!
|
| The main assumption of DevOps is that the friction between
| developers and operations is the main cause of slow
| delivery.
|
| Well this friction can be happen between dev and legal, and
| dev and sales, and dev and product, and ops and product,
| and ops and sales, and ops and finance, etc.
|
| The companies that need DevOps have a much deeply rooted
| problem: the lack of collaboration between any department
| (not just dev and ops). Therefore DevOps won't solve their
| problems. And companies that don't have this problem won't
| need DevOps, because they don't have problems with
| collaboration.
| solatic wrote:
| > DevOps is a flawed concept... conflicting interests... the
| conclusion is that they just don't want to
| collaborate/communicate.
|
| Wrong! This tendency is exactly what DevOps explains is the
| natural state of affairs without DevOps principles. The
| solution that DevOps advocates for is that such conflicting
| interests must not be expressed in meetings (where the
| culture conflict ensures they will get nowhere) but rather
| expressed in code. Infrastructure must be in code, deploys
| must be in code, testing must be in code, builds must be in
| code, policy must be in code, and the implicit pipeline with
| all the handoffs between teams connecting them all must also
| be in code. This makes everything fast (at least in
| comparison to manual processes), and makes everything
| explicit (in code) so that people can reach outside of their
| natural organizational silos to propose changes elsewhere in
| the pipeline, i.e. infra-focused engineers can add failing
| tests to prove the existence of a bug, developers can add
| infra that they need, QA can increase infra resources to
| ensure that sufficient resources are available for expected
| scale.
|
| The problem most organizations have is that they're not
| actually willing to force everyone's concerns to be written
| in code, and people are forbidden from reaching outside their
| silos. Usually this is due to poor hiring and training
| practice, e.g. "QA doesn't know how to write code" or "we
| can't let developers touch security policy". Sometimes it is
| due to leadership itself misunderstanding DevOps ("developers
| are forbidden from touching production").
|
| > If DevOps was a real thing there would be a need for
| LegalOps, DevSales, DevProduct, HROps, etc.
|
| There _is_ such a need, and it is generally being fulfilled
| by the systems in place. Small example - HROps would be
| ensuring that changes in the organization (i.e. people moving
| teams) accurately results in proper loss (of old) and gain
| (in new) privileges. This is done with integration between HR
| systems (the system of record as to who reports to whom and
| what their responsibilities are) and Active Directory or
| Google Workspace / Google Groups ensuring that people are
| automatically moved between the groups to which permissions
| are granted.
| zsoltkacsandi wrote:
| > Wrong! This tendency is exactly what DevOps explains is
| the natural state of affairs without DevOps principles. The
| solution that DevOps advocates for is that such conflicting
| interests must not be expressed in meetings (where the
| culture conflict ensures they will get nowhere) but rather
| expressed in code.
|
| Yes, there is the theory, principles, advocates, etc, etc.
| And there is the reality, and based on many-many years of
| experience as a dev, as an ops, and as a manager at from
| small-sized to enterprise companies, the reality isn't even
| close to this.
|
| > Infrastructure must be in code, deploys must be in code,
| testing must be in code, builds must be in code, policy
| must be in code, and the implicit pipeline with all the
| handoffs between teams connecting them all must also be in
| code.
|
| This is not DevOps, and you don't need DevOps for this.
| This is just about having an engineering mindset.
|
| > so that people can reach outside of their natural
| organizational silos to propose changes elsewhere in the
| pipeline
|
| Do you know how many times have I seen a developer touching
| Terraform code, ansible playbooks, or pipelines described
| as code? I am not saying that it never happened, but it was
| a rare occasion.
|
| > The problem most organizations have is that they're not
| actually willing to force everyone's concerns to be written
| in code
|
| I managed such an enforcement and change. It did not solve
| the cultural and collaboration issues.
|
| > HROps would be ensuring that changes in the organization
| (i.e. people moving teams) accurately results in proper
| loss
|
| This is just a matter of SoPs and workflows. It has nothing
| to do with the topic.
|
| No offense, but your comment is a perfect reflection of why
| DevOps is a flawed concept. You are talking about
| enforcement of everything described as code, advocates,
| principles, etc.
|
| If there is a good culture and collaboration the
| infrastructure as code, the advocates, etc will come
| naturally. People will find a way to collaborate. But not
| the other way around, these won't fix the culture.
| nijave wrote:
| While reaching across silos seems good, in theory, my
| experience is there's just too much breadth and domain
| knowledge for it to work consistently.
|
| Sure I know application code and have worked with a handful
| of frameworks, but if I'm enforcing infrastructure or
| performance concerns and implementing across a handful of
| different services, it's extremely time consuming getting
| up to speed in each repo and understanding the subtleties
| and patterns between each one.
|
| I can optimize queries and debug performance issues but the
| usual roadblock is understanding what the code is
| _supposed_ to do and whether an optimization provides the
| correct results (which is not always clear from tests,
| assuming good ones exist)
| qwertox wrote:
| SRE == Site Reliability Engineering.
|
| Quoting Wikipedia:
|
| Site Reliability Engineering (SRE) is a discipline in the field
| of Software Engineering that monitors and improves the
| availability and performance of deployed software systems, often
| large software services that are expected to deliver reliable
| response times across events such as new software deployments,
| hardware failures, and cybersecurity attacks[1]. There is
| typically a focus on automation and an Infrastructure as code
| methodology. SRE uses elements of software engineering, IT
| infrastructure, web development, and operations[2] to assist with
| reliability. It is similar to DevOps as they both aim to improve
| the reliability and availability of deployed software systems.
|
| https://en.wikipedia.org/wiki/Site_reliability_engineering
| tetris11 wrote:
| Thank you. Ridiculous that every other acronym was defined
| except the one in the title..
| doublerabbit wrote:
| Modern day SysAdmin.
|
| SysOp > SysAdmin > SRE
|
| No different to what I've been doing for the past 15 years. Web
| 2.0 needed a new buzzword is all.
|
| SREs are System Admins who come from the development
| background.
|
| System Admins come from System Operators background.
|
| Uptime is easy when people actually listen to what I have to
| say or listen to NetOps.
|
| Rather than DevOps throwing $next technology at everything or
| "needing" 100x more X because their codebase lacks.
| jmillikin wrote:
| SREs are programmers who specialize in writing programs that
| manage complex distributed systems.
|
| If you hire SREs and have them doing sysadmin work, then (1)
| you're massively over-paying and (2) they'll get bored and
| leave once they find a role that makes better use of their
| skills.
|
| If you hire sysadmins for SRE work, they'll get lost the
| first time they need to write a kernel module or design a
| multi-continent data replication strategy.
| doublerabbit wrote:
| I stand ish corrected. Feels the same difference to of that
| of a Senior Sysadmin. I do both, I wouldn't call myself
| SRE.
| dijit wrote:
| > If you hire sysadmins for SRE work, they'll get lost the
| first time they need to write a kernel module or design a
| multi-continent data replication strategy.
|
| Ah yes, the old (incorrect) mantra of "sysadmins couldn't
| code". Which is ironic, as the vast majority of the
| abstractions that you'll interface with are written by
| sysadmins.
| qwertox wrote:
| IDK, writing things like kernel modules to improve the
| reliability of a complex system doesn't really sound like
| a task sysadmins get paid for.
|
| Yes, a lot of coding (mostly in scripting languages) is
| normal, mostly to automate tasks and improve visibility
| into the system, to make data digestible for tools like
| Grafana, but other optimizations seem to be out of
| bounds.
|
| But I most likely do lack the insights you have.
| dijit wrote:
| I've written kernel code to do various anti-ddos stuff,
| however its the exception for sure.
|
| Debugging complex systems is _more_ in the wheelhouse of
| sysadmins. When I came up it was a requirement for
| sysadmins to be proficient in C, a commandline debugger
| (usually gdb), the unix /linux syscall interface
| (understanding everything that comes out of strace for
| example) and perl.
|
| Usually those perl scripts ended up becoming an
| orchestration/automation platform of some kind- ruby
| replaced perl at some point. I guess it's python and Go
| now?
|
| The modern "kernel module" requirement is more likely to
| be a kubernetes operator or terraform module, and the
| modern day sysadmin definitely writes those (the rest of
| the role is essentially identical, just tools got better)
| rzz3 wrote:
| These days I wonder if Google is really the example to follow.
| There was a time 10 or 15 years ago where Google seemed to be
| leading the industry in everything, and I feel like a lot of
| people still think they do when it comes to engineering culture.
| These days I tend to see Google as a bit of a red flag on a
| resume, and I have a set of questions I ask to make sure they
| didn't drink too much of the koolaid. Perhaps more importantly,
| when I look at Google from the outside these days, I see that
| their products have really gone downhill in terms of quality. I
| see Google Search riddled with spam, I see Gemini struggling to
| keep up with OpenAI, Google Chat trying to keep up with Slack but
| missing the mark, Nest being stagnant, I could go on and on. All
| this to say that I don't think Google is the North Star that it
| used to be in terms of guiding engineering culture throughout the
| industry.
| dehrmann wrote:
| I agree on the product and customer service front, but Google's
| reliability is top-notch.
| mirashii wrote:
| As a Google Cloud customer, I'd say it might be best to split
| Google into some divisions or something, as Google Cloud's
| reliability is a relative shitshow compared to Google.com.
| pphysch wrote:
| Who is then?
| scarface_74 wrote:
| From a product standpoint every BigTech company has done
| better at releasing new products than Google.
| pphysch wrote:
| How did we get from SRE culture to (paraphrasing) "I
| personally think Google makes worse products than IBM,
| Oracle, Apple, Netflix, Broadcom, et al."
| scarface_74 wrote:
| Having good technology and good products are orthogonal.
| People are conflating the two
| hollowsunsets wrote:
| What defines a good product? Something that many
| customers use? Something that makes shareholders happy?
| scarface_74 wrote:
| A product that either moves the needle as far as revenue
| and/or makes the ecosystem better. It also needs to be a
| product that gets continuously better as long as there is
| a market for it and not abandoned quickly.
|
| - "a connected TV device". How many cancelled lines of
| products have they abandoned? How many market failures
| have they had in their own line of phones? The Pixel's
| aren't taking the world by storm and they spent billions
| on Motorola and then sold it off for scraps
|
| They have been releasing a cancelling their own tablet
| initiatives for years.
|
| At one point they had 5 separate messaging initiatives
| going on simultaneously.
|
| Even today they have three operating system initiatives
| that are not on the same codebase - Fuscia, Android and
| ChromeOS.
|
| They have basically abandoned Flutter and don't use it
| for any of their high profile apps.
|
| What have they actually done besides ads?
|
| And the obvious evidence is their money losing "other
| bets"
|
| Also Google Fiber
|
| https://www.spglobal.com/marketintelligence/en/news-
| insights...
| jofla_net wrote:
| this came to mind
|
| https://www.spiceworks.com/tech/data-management/news/google-...
| sanj wrote:
| Fixed about a week later:
|
| https://support.google.com/drive/thread/245861992/drive-
| for-...
| znpy wrote:
| it shouldn't have happened in the first place.
| taeric wrote:
| I'm curious what you have in mind for evidence of "koolaid"
| there?
|
| Hard not to disagree with the general trend you are outlining.
| Most of that feels driven by product choices, moreso than
| execution. I think a lot of the previous glorification of their
| work was likely misguided, as well. But I would be hard pressed
| to be quantitative on that.
| scarface_74 wrote:
| It was 5 years after Android was introduced that the CEO
| stopped using BlackBerry...
| taeric wrote:
| An amusingly good quantification of some evidence. Well
| done! :D
|
| Still, I don't have much to say that I think the
| engineering was overly good or bad. I typically think that
| what they captured for a short while, at least, was
| enthusiasm. In particular, developers were enthusiastic to
| be near Google technology in a way that I don't think I've
| seen for other companies, since.
|
| I don't think they identified it as such, though. Which
| could be why they seem slow to see that a lot of that has
| evaporated.
|
| Not to say that they have no enthusiasm, now. I'd wager
| they still have a lot. But as a percentage share of all
| developers, it feels very different.
| scarface_74 wrote:
| I would never hire a _product_ person from Google or someone I
| needed to be visionary. For the most part, their products suck,
| they have no vision and no follow through.
|
| But their _technology_ is top notch. I hire mostly for startups
| and green field initiatives though and I wouldn't hire anyone
| from any BigTech company unless I had "hard" technical problems
| to solve.
|
| Yes I've done a stint at BigTech.
| ninkendo wrote:
| They have top notch tech, yes, but it's massively overkill
| for literally every company that's not at google's scale. If
| you're not careful you may hire someone who will try to
| replicate everything google does, when you may need only 1%
| of the complexity. This is the experience I've generally had
| with xooglers... they lament that they don't have the same
| tools/tech stack they had at google, and so their first act
| is to try to move everything to the closest open source
| equivalents, even if they're not a good fit.
|
| There's good things and bad things to take away from
| experience at google... you have to be careful to ignore
| things that won't actually help you.
| scarface_74 wrote:
| I agree. I haven't run into a "hard problem" in my career
|
| By hard problem I mean technically at the top 5% of a
| problems in the industry that can't be solved by throwing
| money at a SaaS or using a cloud provider.
| everfrustrated wrote:
| Most companies are really just crud apps. Very few are
| doing anything technically innovative. And that's just
| fine.
|
| I wonder how much of the early Google technical
| innovation was more a product of open source
| tech/distributed systems being a lot more immature (I'm
| particularly thinking databases) 25 years ago.
|
| Ultimately all companies get bloated and loose their way.
| It shouldn't be a suprise this has happened to Google -
| 25 years on they are mega corp and idling. Probably for
| the best as it allows innovators a chance to compete.
| deepsun wrote:
| I've been the "you're not google" person for several years,
| but now softened my position.
|
| The thing is -- it depends. Sometimes when everyone knows
| some complex system well -- it becomes easy.
|
| One example comes to mind -- Kubernetes. 90% of teams don't
| need all its complexity. And I've been "you don't need it"
| person for some time. But now I see that when everyone
| knows it -- it's actually much easier to deploy even simple
| websites on it, because it's a common lingo and you don't
| spend time explaining how it's deployed.
|
| It's not like civic engineers, when an over-engineered
| bridge would cost a lot more in materials.
| scarface_74 wrote:
| If you have a simple website , you can containerize your
| backend and use much simpler services from AWS and serve
| your static assets on S3.
|
| Kubernetes is rarely the right answer for simple things
| even if Docker is.
| mschuster91 wrote:
| > much simpler services from AWS
|
| Like what, Lambda? I've seen so much horrible hacks and
| shit done with it (and other AWS services _cough_ API
| gateway _cough_ ), these days I rather prefer a set of
| Kubernetes descriptors and Dockerfiles.
|
| At least that combination all but _enforces_ people doing
| Infrastructure-as-a-code and there 's (almost) no
| possibility at all for "had to do live hack XYZ in the
| console and forgot to document it or apply it back in
| Terraform" .
| scarface_74 wrote:
| AWS App Runner
|
| https://aws.amazon.com/blogs/containers/introducing-aws-
| app-...
|
| Google has something similar.
| icedchai wrote:
| GCP has Cloud Run, which looks similar. App Runner is
| basically a wrapper on top of Fargate, right?
| scarface_74 wrote:
| Yep. Every "serverless" compute service is just a wrapper
| on top of Firecracker including Lambda and Fargate.
| icedchai wrote:
| In my experience, you are better off with ECS/Fargate
| than Lambda for serving an API. You get much more
| flexibility.
|
| Also, I've witnessed people editing Lambda code through
| the console instead of doing a real deploy. what a
| mess...
| scarface_74 wrote:
| You can't edit Lambda code in the console when you deploy
| a Docker image to Lambda.
|
| As far as flexibility, while there have been third party
| libraries that let you deploy your standard Node/Express,
| .Net/ASP, Python/Flask app to Lambda, now there is an
| official first party solution
|
| https://github.com/awslabs/aws-lambda-web-adapter
|
| And as far as ECS, it is stupid simple
|
| I've used this for years
|
| https://github.com/1Strategy/fargate-cloudformation-
| example/...
| mschuster91 wrote:
| > Also, I've witnessed people editing Lambda code through
| the console instead of doing a real deploy. what a
| mess...
|
| Yeah, exactly that's what I am talking about. Utter
| nightmare to recover from, especially if whoever set it
| up thought they needed to follow some uber-complex AWS
| blog post with >10 involved AWS services and didn't
| document any of it.
| scarface_74 wrote:
| You can't edit Lambda code directly in the console when
| using Docker deployments
| deepsun wrote:
| Your response is the perfect example of my point. Each
| time you use "much simpler services" you still _need to
| explain_ the setup for the simpler services. Someone
| might know it, someone not. E.g. some project may
| eventually grow out of Lambda RAM limitations, but noone
| in the team knew that. While Kubernetes is one-size-fits-
| all setup, even if I don't like it.
|
| And yes, I use the Cloud Run myself, but only for my one-
| person projects. For the team projects consistency is
| much more important (same way to access/monitor/version
| etc).
|
| PS: I would say even AWS/GCP is already a huge overkill
| for most projects. But for some reason you didn't see
| exactly the same problem starting with clouds right away.
| scarface_74 wrote:
| Lambda can use up to 10GB of RAM and there is also App
| Runner.
|
| And "using AWS" can be as simple as starting off with
| Lightsail - a VPS service with fixed pricing
|
| https://aws.amazon.com/lightsail/
| deepsun wrote:
| RAM is just one example. Every simpler service has its
| limitations, and if everyone (including new hires) knows
| the simpler service well -- it's perfect. E.g. in my
| experience everyone knew App Engine at some point and it
| worked well for us. Now it's a zoo of devops pieces, so I
| tolerate Kubernetes only because everyone kinda knows it.
|
| And the Kubernetes was just one example of my "you're not
| google" point. There is many more technologies that are
| definitely overkill, but is a good common denominator,
| even when it's 1000x more complex than needed for the
| task at hand.
|
| PS: Btw, I dunno why people downvoted your comment. It's
| fits the HN "Guidelines" at the bottom, so upvoted.
| elktown wrote:
| The problem is that it can create a chain-reaction of
| complexity _because_ it opens up possibilities for over-
| engineering. In the sense of: "Yes, it's a bit over-
| engineered, but k8s makes it manageable for us anyway!" -
| consciously or subconsciously. When I'd often suspect
| that some restrictions in what's possible/acceptable
| would've created a significantly leaner overall design in
| the end.
| VirusNewbie wrote:
| > but it's massively overkill for literally every company
| that's not at google's scale.
|
| Before I worked at Google I was at a small telecom company
| that was running into limits of what some of the
| Dataflow/Apache Beam product could do, so we had to rewrite
| it (and commit it back to Beam).
|
| There _are_ companies that have massive scaling issues even
| if they 're not planet scale cloud providers or something.
|
| You can replicate a lot of Google tech now by....just using
| the OSS they release and/or jumping on a modern cloud
| provider (GCP or AWS). It's not 2012. You can use a good
| database and not have to reinvent it.
| jarsin wrote:
| And now we are all stuck doing leetcode interviews primarily
| because of Google.
| marssaxman wrote:
| Leetcode didn't exist back then; the site was founded a
| little less than a decade ago.
| jarsin wrote:
| I was using "Leetcode" as in style of the interview. The
| Leetcode website was founded due to everyone in the
| industry copying google/big tech in these style of
| interviews.
| mike_hearn wrote:
| Google copied that idea from Microsoft, primarily.
| fweimer wrote:
| The flip side from this assumption about their technology is
| that if some service is not working, people are very quick to
| blame the impacted (paying) user. "You are running into rate
| limits", "Google is applying anti-abuse controls to your
| account", and so on. But at least for some services, I
| strongly suspect we are actually experiencing random system
| failures. In my experience, it's rare to get acknowledgement
| of this from Google. Tickets may not even make it to them
| because of this pervasive "Google technology is perfect"
| assumption. The end result feels a bit like gaslighting
| (doubting our sanity because we can't spot the pattern that
| is supposed to be obvious): we are encouraged to attribute
| meaning to more or less random reactions from a complex
| system.
| brudgers wrote:
| Unless you have Google sized problems and resources, Google
| probably is not the best example because the things Google does
| are done to address Google size problems with Google sized
| resources. It's tooling and methods are not commercial
| products.
|
| For example, Google can get away with the flaws of it's AI
| search results because it is Google.
| jeffbee wrote:
| The fact that some people prefer ChatGPT over Gemini is not
| something that SRE can help you with. The fact that ChatGPT is
| rarely available is something that SRE could help Microsoft
| avoid.
| lupire wrote:
| ChatGPT is rarely available??
| jeffbee wrote:
| They have major, long-lasting incidents at least once a
| week. https://status.openai.com/
| yodsanklai wrote:
| > There was a time 10 or 15 years ago where Google seemed to be
| leading the industry in everything
|
| They used to write interesting books and articles about
| software engineering. It felt that they were maintaining high
| quality standards and were an industry reference. Nowadays, I
| wouldn't go as far as saying it's a red flag to have Google on
| one's resume, but definitely not the same appeal as before.
| nimish wrote:
| Goodhart's Law for employment
| jph wrote:
| The article describes Causal Analysis based on Systems Theory
| (CAST) which is akin to many-factor root cause analysis.
|
| I am a big fan of CAST for software teams, and of MIT Prof. Nancy
| Leveson who leads CAST.
|
| My CAST summary notes for tech teams:
|
| https://github.com/joelparkerhenderson/causal-analysis-based...
|
| MIT CAST Handbook:
|
| http://sunnyday.mit.edu/CAST-Handbook.pdf
| pulkitsh1234 wrote:
| Are there any resources to show how to apply this in practice?
| This is too theoretical to grok for me, there are too many
| terms. It seems too time-consuming to understand (and to
| perform IMO)
| jph wrote:
| > This is too theoretical to grok for me
|
| Here's a fast, easy, practical way to think about CAST:
|
| 1. Causal: Novices may believe accidents are due to one "root
| cause" or a few "probable causes", but it turns out that
| accidents are actually due to many interacting causes.
|
| 2. Analysis: Novices may blame people, but it's smarter to do
| blame-free examination of why the loss occurred, and how it
| occurred i.e. "ask why and how, not who".
|
| 3. Systems: Novices may fix just one thing that broke, but it
| turns out it's better to discover multiple causes, then plan
| multiple ways to improve the whole system.
| m0nkee wrote:
| like your points
| materielle wrote:
| I was listening to a Titus Winters podcast, and I'm not sure he
| exactly put it like this, but I took it away as:
|
| There are two problems with automated testing. 1) tests take
| too long to run 2) difficult to root cause breakages.
|
| Most devs solve this with making unit tests ever more granular
| with heavy use of mocks/fakes. This "solves" both problems in a
| narrow sense: the tests run faster and are obvious to root
| cause breakages.
|
| But you didn't actually solve the problem. Since the entire
| point of writing tests in the first place was to answer the
| question: "does my system work"? Granular and mocked unit tests
| don't help much.
|
| However, going back to the original question, we can actually
| reframe the problems as: 1) a work scheduling problem and 2) a
| signal processing problem.
|
| Those are pretty well understood problems with good solutions.
| It's just that this is a somewhat novel way of thinking of
| tests, so it hasn't really been integrated into the open source
| tool chain.
|
| You could imagine integration tests automatically be correlated
| to a micro service release. Some CI automation constantly
| running expensive tests over a range of commits and
| automatically bisecting on failure. Etc.
|
| Put another way, automated tests don't go far enough. We need
| yet another higher layer of abstraction. Computers are better
| at deciding what tests to run and when, and are also better at
| interpreting the results.
| azurelake wrote:
| > Put another way, automated tests don't go far enough. We
| need yet another higher layer of abstraction. Computers are
| better at deciding what tests to run and when, and are also
| better at interpreting the results.
|
| Sounds like you might be interested in
| https://antithesis.com/ (no affiliation).
| typesanitizer wrote:
| Thanks for writing the summary notes and sharing those here.
| After reading the Usenix article, I was thinking that we could
| apply some of the ideas at $WORK, but the exact "How" was still
| not super clear. Your notes offer a compact and accessible
| starting point without having to ask colleagues to dive in to a
| 100+ page PDF. :D
| MPSimmons wrote:
| This reminds me very much of Sidney Dekker's work, particularly
| The Field Guide to Understanding Human Failure, and Drift Into
| Failure.
|
| The former focuses on evaluating the system as a whole, and
| identifying the state of mind of the participants of the
| accidents and evaluating what led them to believe that they were
| making the correct decisions, with the understanding that nobody
| wants to crash a plane.
|
| The latter book talks more about how multiple seemingly
| independent changes to complex loosely coupled systems can
| introduce gaps in safety coverage that aren't immediately
| obvious, and how those things could be avoided.
|
| I think the CAST approach looks appealing. It seems as though it
| does require a lot of analysis of failures and near-misses to be
| best utilized, and the hardest part of implementing it will
| undoubtably be the people, who often take the "there wasn't a
| failure, why should we spend time and energy investigating a
| success" mindset.
| jph wrote:
| Yes you're 100% right. Dekker is a valuable complement to CAST
| & STAMP because Dekker emphasizes people aspects of psychology,
| goals, beliefs, etc., while CAST emphasizes engineering aspects
| of processes, practices, metrics, etc.
|
| CAST describes how to pragmatically bring together the people
| aspects and the engineering aspects, by having stakeholders
| write a short explicit safety philosophy:
|
| https://github.com/joelparkerhenderson/safety-philosophy
| FuriouslyAdrift wrote:
| I think the single biggest thing about Google SREs (at least in
| the early years) was that if your team was going to launch a new
| product, you had to have an SRE to help and to maintain the
| service.
|
| Google deliberately limited the amount of SREs, so you had to
| prove your stuff worked and sell it to the SRE to even get a
| chance to launch.
|
| Constraints help to make good ideas better...
| hollowsunsets wrote:
| It's not good when you have an SRE on hand to act as a
| babysitter of sorts. That is how some companies use SREs these
| days. They do the toil and sysadmin work so the product
| engineers can focus on features. Exactly what we hoped to
| avoid, but here we are.
| sgarland wrote:
| If by some you mean nearly all, then yes, and yes, it's
| terrible.
|
| Super fun being the adult in the room having to explain for
| the millionth time why someone can't expect that a network
| call will always succeed, and will never experience latency.
| nvarsj wrote:
| Even at Google it's like this. I spent the holidays watching
| my on-call Google SRE friend trying to diagnose misbehaving
| mapreduce jobs devs had written. They are basically glorified
| first line support so SWEs don't get woken up in the night.
|
| Which seems like the worst possible setup to me - devs should
| be first on call for code they write. That seems like a basic
| principle to me and creates the correct incentives.
| arthurjj wrote:
| Thanks for this detail, I worked at Google, with SREs, and
| didn't know it. It seems like the type of 'design' detail that
| might be more important than this entire article
| emtel wrote:
| This culture was, imo, directly responsible for google's
| failure to launch a facebook competitor early enough for it to
| matter.
|
| The Orkut project was basically banned from being launched or
| marketed as an official google product because it was deemed
| "not production ready" by SRE. Despite that it gained huge
| market share in Brazil and a few other countries before
| eventually losing to FB. By the time their "production ready"
| product (G+) launched it was hilariously late.
|
| Facebook probably would have won anyway, but who knows what
| might have happened if Google had actually leaned into this
| very successful project instead of treating it like an unwanted
| step-child.
| mike_hearn wrote:
| How was it banned from being launched? It did launch and the
| desire to not be promoted as a Google product came from Orkut
| himself, iirc.
|
| The reason it was not regarded as 'production ready' was that
| the architecture didn't scale. In fact it also didn't run on
| the regular Google infrastructure that everything else used
| and that SRE teams were familiar with; it was a .NET app that
| used MS SQL Server.
|
| This design wasn't a big surprise. Facebook won not because
| Orkut lost but because Facebook were the first to manage
| gradual scaleup without killing their own adoption, by
| figuring out naturally isolated social networks they could
| restrict signup to (American universities). This made their
| sharding problem much easier. Every other competitor tried to
| let the whole world sign up simultaneously whilst also
| offering sophisticated data model features like the ability
| to make arbitrary friendships and groups, which was hard to
| implement with the RDBMS tech of the time.
|
| Orkut did indeed suffer drastic scaling problems and
| eventually had to be rewritten on top of the regular Google
| infrastructure, but that just replaced one set of problems
| with another.
| ghaff wrote:
| Of course restricting rollout to American university emails
| (including alumni addresses--at least at one point) was
| also a pretty natural consequence of Facebook's origins.
| emtel wrote:
| The attitude within SRE toward Orkut (the product) was one
| of disdain if not contempt. A healthy culture does not
| treat rapidly growing products this way.
| mike_hearn wrote:
| I mean, I'm personal friends with a former Orkut SRE. The
| idea that Google SRE ignored or disdained Orkut just
| isn't right. Nonetheless, if your job is defined as "make
| the site reliable" and it's written in a way that can
| never be reliable then it's understandable that you're
| going to have at least some frustrations.
| philll wrote:
| That sounds dysfunctional.
| crabbone wrote:
| I wish this article was at most a quarter of its current length.
| Preferably even shorter. There's so much self-congratulatory and
| empty talk, it's really hard to get to the main point.
|
| I think, the most important (and actually valuable) part is the
| mention of the work done by someone else (STPA and CAST). That's
| all there is to the article. Read about Causal Analysis based on
| Systems Theory (CAST) and System-Theoretic Process Analysis
| (STPA) do what the book says.
| anal_reactor wrote:
| Agreed that the whole article could've been much shorter.
| Anyway, for me the key takeaway is not to trust your inputs.
| It's true that code correctness often boils down to "given
| input X, the program will correctly give output Y", but the
| actual issue is that sometimes the input X itself might be
| wrong. I think it's clearly visible in project management,
| where people tell you one thing, you plan accordingly, then
| later they do another thing, and if you haven't predicted this,
| you're done. If this behavior is so common in human projects in
| general, I see no reason why it wouldn't emerge in software
| projects too.
|
| The problem is, software that tries to do something smart with
| inputs is much harder to reason about, which in turn increases
| your likelihood of failure, which is exactly the thing you
| wanted to avoid in the first place. For example, imagine you
| have an edge case in your script where you want to perform "rm
| -rf /" but the safety mechanism prevents you from doing this,
| which effectively makes your script fail.
|
| In conclusion, in my humble opinion, the most important part of
| safety is choosing tools that are simplest to reason about. If
| you have a bash script you're guaranteed to have some bug
| related to some edge case - people managing POSIX realized that
| bash is so fundamentally broken that it's better to forbid
| certain filenames rather than fix bash. Use a Python library
| for 10x the safety but half the comfort. If you have a C++
| program it will leak memory no matter how hard you try. And so
| on.
|
| Similarly, when writing programs, you should give simple and
| strong promises about its API. Don't ever do "program accepts
| most sensible date strings and tries to parse that", do "it's
| either this specific format or an error".
|
| Verifying inputs and being smart about them is a good idea that
| should be used carefully because it can backfire spectacularly.
| herodoturtle wrote:
| Came here to say this.
|
| It's not the first article / publication on Google SRE I've
| read, and they're all similarly (and imho unnecessarily)
| verbose.
|
| Whilst I'm deeply grateful to the good folks at Google for
| sharing their hard-earned knowledge with us, I do wish their
| publications on this important topic were far more succinct.
| cudgy wrote:
| Article is about an acronym and yet never states what the acronym
| SRE means.
| packetslave wrote:
| Not everything needs to be spelled out for people who are too
| lazy to Google
| n0n0n4t0r wrote:
| It does... If you read the authors bio in the footprint:)
| abotsis wrote:
| Couple thoughts here: 1. The "rightsizer" example mentioned might
| well have had the same outcome if the outage was analyzed in a
| "traditional" sense. That said, it is much easier and more
| actionable with this new approach. 2. I've always hated software
| testing because faults can occcur external to the software being
| tested. It's difficult to reason about those if you have a myopic
| view of just your component of in system. This line of thinking
| somewhat fixes that- or at least paves a path to fixing that.
|
| Unfortunately, while this article says a lot, much just repeated
| itself and I'd wish there was more detail. For example: who all
| is involved in this process? Are there limits on what can be
| controlled? How (politically) does this all shake out with
| respect to the relationships between SREs and software engineers?
| Etc..
| wilson090 wrote:
| Agreed, the devil is in the detail for SRE functions, and the
| organizational details of how to leverage this framework are
| largely absent from this writeup. With so many teams struggling
| to get the organizational components right just for traditional
| SRE (due to budget constraints, internal politics,
| misunderstanding of SRE by leadership, etc), I'd imagine
| implementing the changes need to leverage the ideas in this
| writeup will be impossible for all but extremely deep-pocketed
| tech companies.
|
| Nonetheless, lots of interesting concepts, so I would like to
| see a Google SRE handbook style writeup with more info that
| might be of more practical value.
| cmckn wrote:
| SWEs: are SRE/devops folks part of your day to day?
|
| I have never been in a SWE role where I didn't do my own "ops",
| even at FAANG (I haven't worked at Google). I know "SRE/devops"
| was/is buzzy in the industry, but it's always seemed, in the vast
| majority of cases, to be a half-assed rebrand of the old school
| "operations" role -- hardly a revolution. In general, I think my
| teams have benefited from doing our own ops. The software
| certainly has. Am I living in a bubble?
| fragmede wrote:
| I'm assuming in the ops side of your role, you're not filling
| in firewall rules paperwork to an network team, spinning up new
| servers to SSH in and SCP some files over and edit a couple of
| config files though. Operations just doesn't look like that
| anymore, so the fact that SWE teams can now do a meaningful
| amount of operations for their product _is_ the revolution. It
| may not feel like it if you weren 't doing operations the old
| way, but there are a lot of tools that are invisible to make
| things work.
| baalimago wrote:
| SRE and DevOps is better summarized as 'cloud engineering',
| IMO. Basically, it's to set up and maintain the infrastructure
| which allows you to do your own ops as a dev.
| cmckn wrote:
| That's my impression as well. My SWE team has always done all
| of that ourselves, I've never felt the need for a dedicated
| role to maintain IaC and click around in the console.
| n0n0n4t0r wrote:
| I wonder a what scale this very interesting approach start
| yielding more value than cost. What I mean is: is it a faang only
| as so many things they seeded or is it actually relevant at a
| non-faang scale?
|
| I tend to be invest much on risk avoidance, so this is appealing
| to me, but I know that my risk avoidance tendency is rarely
| shared by my peers/stakeholders.
| mgaunard wrote:
| The example here seems to do with sizing appropriately the
| requirements of applications, which enables you to schedule
| more applications per machine, driving down costs.
|
| This is useful for any company larger than say 10 people.
|
| In general this is difficult to do, because there is more at
| play than memory, CPU and disk usage, especially if you have
| certain performance requirements.
|
| I find that what's in Kubernetes (a Google product) pretty much
| useless, but maybe it works for web tech.
| n0n0n4t0r wrote:
| I understood their example more like: automating the scaling
| of servers is easy, having proper inputs for this scaling to
| be reliable is hard.
|
| What they propose is to lend weeks if engineering time to
| perform analysis in the hope to find some relatable issues.
| Are both this engineering time and the issues fixing time
| relevant for non faang companies?
|
| In other words: The lever effect of not having issues is
| fewer, so the rentability of such analysis decrease. Where
| does the rentability become negative?
| mgaunard wrote:
| In practice it's pretty much impossible to get precise
| requirements without automatically learning them from how
| the application performed in the past.
|
| The problem is that it is high-risk to automatically
| perform those changes since they might affect the
| application in ways you do not expect.
| n0n0n4t0r wrote:
| I really don't think they are talking about requirements,
| at least not specifically. Aren't you focusing on your
| own level issues?
| mgaunard wrote:
| the example they gave is the quota rightsizer. Its job is
| to infer the right quota (requirement allocation).
| n0n0n4t0r wrote:
| Yes, but I mean that they shifted the focus to the input
| measurements over correct quota's value.
| mike_hearn wrote:
| It's definitely something that requires a high budget and a
| dedicated reliability team. In most orgs that have got as far
| as a proper post-mortem and analysis culture, they aren't even
| reliably draining the action items generated by the post
| mortems, so attempting to pre-emptively generate action items
| is kind of a moot point.
| lamontcg wrote:
| > Looking at a data flow diagram with more than 100 nodes is
| overwhelming--where do you even begin to search for flaws?
|
| Yeah, so maybe try not to build anything that complex to start
| with.
| __turbobrew__ wrote:
| Yea that was my take away too. Maybe limit the depth of any RPC
| call?
| lamontcg wrote:
| Also, just don't try to be Google and don't microservice the
| crap out of everything...
|
| The problem is that too many people in our industry are
| trying to get experience to land a job at Google so they try
| to turn every job into Google...
|
| Although honestly I suspect that Google could do more to
| simplify internally, but that is the kind of work that
| doesn't get you promoted, while layers of additional smart-
| sounding complexity do.
|
| And it really sounds to me like we've gone wrong as an
| industry, where you can't bolt together lego blocks and get
| working larger systems out of them, and have to worry about
| large scale spooky-action-at-a-distance effects. It is like
| having to worry about the interaction of your radio with your
| car's drive train. A simple fuse keeps the radio from killing
| the engine, and then the designer of the engine never has to
| think about the radio.
| ashepp wrote:
| I've been reading about CAST (Causal Analysis based on Systems
| Theory) and noticed some interesting parallels with mechanistic
| interpretability work. Rather than searching for root causes,
| CAST provides frameworks for analyzing how system components
| interact and why they "believe" their decisions are correct -
| which seems relevant to understanding neural networks.
|
| I'm curious if anyone has tried applying formal safety
| engineering frameworks to neural net analysis. The methods for
| tracing complex causal chains and system-level behaviors in CAST
| seem potentially useful, but I'd love to hear from people who
| understand both fields better than I do. Is this a meaningful
| connection or am I pattern-matching too aggressively?
| triclops200 wrote:
| I do AI/ML research for a living (my degrees were in
| theoretical CS and AI/ML and my [unfinished] phD work was in
| computational creativity [essentially AGI]). I also do SRE work
| as a living.
|
| and yeah that's a useful way of characterizing some of the
| behaviors of some kinds of neural networks. There's a point at
| which the distinction between belief and "frequency (or
| probability-amplitude) state filter" become less apparent,
| though, that's more of a function-of-medium vs function-of-
| system distinction.
|
| However, systems like these can often become mediums,
| themselves, for more complex systems. Additionally, a system
| which has "closed-the-loop" by understanding the medium and the
| system as coupled as "self" and separate from the environment
| along with a direction/goal is a pretty decent, if imprecise,
| definition of a strange loop. Contradiction resolution between
| internal component beliefs gives a possible (imo, highly
| probable) mechanistic explaination for the phenomenon of free
| energy minimization in such systems. External contradiction
| resolution extends it to active inference.
| georgewfraser wrote:
| Like so many things from Google engineering this will be toxic to
| your startup. SREs read stuff like this, they get main character
| syndrome and start redoing the technical designs of all the other
| teams, and not in a good way.
|
| This phenomenon can occur in all "overlay" functions, for example
| the legal department will try to run the entire company if you
| don't have a good leader who keeps the team in their lane.
| physhster wrote:
| In my experience, SREs are usually "enforcers of
| maintainability". If your engineers don't want to be oncall,
| they need to produce applications and services that are
| documented and maintainable. It's an amazing forcing function.
| SRE doesn't often redo technical designs, there's plenty enough
| reliability and scalability work to do...
| jshen wrote:
| Your engineers should be on call.
| physhster wrote:
| At a 200-person company, sure. But when you're in the tens
| or hundreds of thousands, that's a hard no. Especially when
| dealing with out-of-scope dependencies.
| otterley wrote:
| I work for a company with millions of employees. Our SDEs
| and their managers carry and are responsible for
| answering the pagers. We don't have SREs.
| la64710 wrote:
| From the 90s the whole DNS on which the internet is standing
| today was run successfully with minimum error by a bunch of
| folks who used to call themselves sysadmins. Developers seems
| to run out of things to develop and they have been reinventing
| themselves as devops and SREs. They have been pushing out pure
| sysadmins but at the same time this trend shows how demand for
| developers or SWEs falls far short of the supply of developers
| in the market.
| tsss wrote:
| Take one look at the Kubernetes source code and it becomes
| clear that you can make successful software with zero clue
| about good software engineering.
___________________________________________________________________
(page generated 2025-01-04 23:01 UTC)