hngopher.com

       [HN Gopher] The Evolution of SRE at Google
       ___________________________________________________________________
        
       The Evolution of SRE at Google
        
       Author : r4um
       Score  : 277 points
       Date   : 2025-01-03 11:38 UTC (1 days ago)
        
 (HTM) web link (www.usenix.org)
 (TXT) w3m dump (www.usenix.org)
        
       | 0xbadcafebee wrote:
       | They're doing that thing that happened to DevOps. It started out
       | as a guy who wanted a way for devs and sysadmins to talk about
       | deploys together, so they didn't get dead-cat syndrome. It ended
       | up as an entire branch of business management theory,
       | consultants, and a whole lot of ignorant people who think it just
       | means "a dev who does sysadmin tasks".
       | 
       | Abuse of a single word to mean too many things makes it
       | meaningless. SRE now has that distinction. You've got SREs who
       | (mostly) write software, SREs who (mostly) stare at graphs and
       | deal with random system failures, and now SREs who use a
       | framework to develop a complex risk model of multiple systems
       | (which is more quality control than engineering).
        
         | inquist wrote:
         | I think failure mode analysis is definitely part of engineering
        
         | ImPostingOnHN wrote:
         | Without resorting to any "big-D Devops" definition, I have
         | almost always seen devops referring to "supporting / running
         | the code you write", and have never encountered the definition
         | where dev and ops were 2 different roles. That was what things
         | were like before devops, and coordination on product support
         | and planning wasn't great, hence devops.
        
           | moandcompany wrote:
           | A lot of organizations simply renamed the functional area of
           | "systems administration" or "systems engineering" to
           | "DevOps," and at many of these places, "DevOps" is the new
           | name for the group that software developers will throw stuff
           | over the fence to.
           | 
           | The issue with the above names is that they can be applied to
           | a domain or area of practices, or an organizational boundary.
           | In a non-trivial number of organizations, "DevOps" is viewed
           | as a support entity for one or more software development
           | teams, versus software development teams practicing "devops."
           | 
           | This applies to many of the *_Ops names in fashion during the
           | past five years or so.
        
             | znpy wrote:
             | After almost 10 years in the systems engineering /
             | administration / devops / cloud etc space all I can say is:
             | 
             | The biggest improvement that devops brought is that it made
             | managers feel dumb, outdated and scared because they were
             | not "doing devops" while everybody else was, so they kinda
             | started listening to sysadmins and what they had to tell.
             | 
             | Uh, devops engineers did not come out of nowhere. They did
             | not come out of the ground like mushrooms. Most if not all
             | the "devops engineers" i know are just former sysadmins.
             | They were already willing to do whatever devops was
             | supposed to be, it's just they they were largely ignored.
             | 
             | Writing this I just realized that maybe the best way to
             | obtain organizational change is to make management and
             | upper management feel stupid and outdated. Interesting.
        
               | steveBK123 wrote:
               | Hence they have all hired "head of AI" in last 18 months.
        
               | moandcompany wrote:
               | "What is your AI strategy?" -> "We're hiring a Head of
               | AI"
        
               | steveBK123 wrote:
               | It's the 2020s version of the old IBM line... No one ever
               | got fired for hiring a head of AI.
        
           | stackskipton wrote:
           | I'm Ops type person so I work at companies where there is a
           | split between the two. Ops is a skill not all developers have
           | or frankly, not even mindset to properly do so you will need
           | a team/person to do it. Generally companies don't like the
           | cost of embedding Ops person into every team and that can
           | create redundant work so they form a DevOps/SRE team.
           | 
           | Good resource for different types of teams is here:
           | https://web.devopstopologies.com/
        
           | 0xbadcafebee wrote:
           | The reason you never encountered the second definition is
           | two-fold:
           | 
           | 1) There is no formal academic education behind the concept
           | (that I'm aware of). If you do a CS major, nobody's going to
           | explain to you the accumulated 15 years of practice and
           | knowledge around the concept.
           | 
           | 2) Due to 1), people just repeat what other people tell them.
           | It's like a long game of telephone. It turns out most
           | software development today is just a game of telephone
           | between devs (and now AI). So almost everyone is misinformed.
           | 
           | The Wikipedia page for DevOps is the best generic starting
           | point if you want to know more.
           | 
           | If you want to know more after that, there are a number of
           | books and blog posts. Jez Humble, John Willis, Gene Kim,
           | Patrick Debois, etc are the people to read. It's a much
           | larger body of knowledge than you might think. Almost none of
           | it has to do with devs supporting/running what they write
           | (that's a small subset of a larger category, and there's
           | multiple categories of 'stuff')
        
         | dilyevsky wrote:
         | > You've got SREs who (mostly) write software, SREs who
         | (mostly) stare at graphs and deal with random system failures,
         | and now SREs who use a framework to develop a complex risk
         | model of multiple systems (which is more quality control than
         | engineering).
         | 
         | This was always the case or at least going back 15 years or
         | more highlighted by the so called "treynor curve"
        
           | pmb wrote:
           | NB: The Treynor Curve is named after Ben Treynor and his
           | ideas. Ben Treynor's name changed to Ben Sloss a few years
           | back, and Ben Sloss is one of the authors of this article.
        
         | 01HNNWZ0MV43FF wrote:
         | Never heard of dead-cat syndrome. In case anyone else wonders:
         | 
         | > There is one thing that is absolutely certain about throwing
         | a dead cat on the dining room table - and I don't mean that
         | people will be outraged, alarmed, disgusted. That is true, but
         | irrelevant. The key point, says my Australian friend, is that
         | everyone will shout, "Jeez, mate, there's a dead cat on the
         | table!" In other words, they will be talking about the dead cat
         | - the thing you want them to talk about - and they will not be
         | talking about the issue that has been causing you so much grief
         | 
         | https://en.wikipedia.org/wiki/Dead_cat_strategy
        
           | johnkpaul wrote:
           | I actually don't think that's the dead-cat-saying that the
           | parent is referencing. I think that it's this concept
           | http://itskeptic.org/dead-cat-syndrome.html
           | 
           | I am also unfamiliar though and I'm reading up on it right
           | now.
        
             | 0xbadcafebee wrote:
             | That's the one. Only old fogies (like me) know it I guess.
             | It was the thing we all referred to as the impetus behind
             | DevOps, when it became a thing a decade ago.
        
               | dijksterhuis wrote:
               | i'm not an old fogie and i kind of guessed "throwing a
               | dead cat over the wall" was what you meant.
               | 
               | although i was actually taught DevOps proper before it
               | got super fashionable.
               | 
               | maybe that makes me an old fogie nowdays? :/
        
         | paulddraper wrote:
         | > devs and sysadmins to talk about deploys together
         | 
         | Not quite, it was the merger of development and operations
         | responsibilities.
         | 
         | I.e. devs who would sysadmin / sysadmins who would development.
        
         | zsoltkacsandi wrote:
         | > a way for devs and sysadmins to talk about deploys together
         | 
         | My take on this as someone who has 14 years of dev and 7 years
         | of ops experience: DevOps is a flawed concept.
         | 
         | The problem never was the lack of communication between the
         | devs and sysadmins, it's just a symptom.
         | 
         | The root cause is that the management puts pressure on the devs
         | to innovate and deliver as fast as possible, and puts pressure
         | on the ops to ensure that the system is stable, reliable,
         | scalable, it has a 99.95% uptime and any issues will be solved
         | by the on-call.
         | 
         | So these two groups have conflicting interests and when this
         | leads to conflicts and arguments the conclusion is that they
         | just don't want to collaborate/communicate.
         | 
         | There are many departments at a company that have conflicting
         | interests and can interfere with each other. If DevOps was a
         | real thing there would be a need for LegalOps, DevSales,
         | DevProduct, HROps, etc..
        
           | fragmede wrote:
           | Paralegals, legal clerks; sales engineers; product
           | management, project management, TPMs; HR coordinator, HR
           | administrative assistant...
           | 
           | I'm not sure your examples say we don't need DevOps!
        
             | zsoltkacsandi wrote:
             | > Paralegals, legal clerks; sales engineers; product
             | management, project management, TPMs; HR coordinator, HR
             | administrative assistant
             | 
             | The point wasn't that there are multiple job titles in
             | different fields.
             | 
             | > I'm not sure your examples say we don't need DevOps!
             | 
             | The main assumption of DevOps is that the friction between
             | developers and operations is the main cause of slow
             | delivery.
             | 
             | Well this friction can be happen between dev and legal, and
             | dev and sales, and dev and product, and ops and product,
             | and ops and sales, and ops and finance, etc.
             | 
             | The companies that need DevOps have a much deeply rooted
             | problem: the lack of collaboration between any department
             | (not just dev and ops). Therefore DevOps won't solve their
             | problems. And companies that don't have this problem won't
             | need DevOps, because they don't have problems with
             | collaboration.
        
           | solatic wrote:
           | > DevOps is a flawed concept... conflicting interests... the
           | conclusion is that they just don't want to
           | collaborate/communicate.
           | 
           | Wrong! This tendency is exactly what DevOps explains is the
           | natural state of affairs without DevOps principles. The
           | solution that DevOps advocates for is that such conflicting
           | interests must not be expressed in meetings (where the
           | culture conflict ensures they will get nowhere) but rather
           | expressed in code. Infrastructure must be in code, deploys
           | must be in code, testing must be in code, builds must be in
           | code, policy must be in code, and the implicit pipeline with
           | all the handoffs between teams connecting them all must also
           | be in code. This makes everything fast (at least in
           | comparison to manual processes), and makes everything
           | explicit (in code) so that people can reach outside of their
           | natural organizational silos to propose changes elsewhere in
           | the pipeline, i.e. infra-focused engineers can add failing
           | tests to prove the existence of a bug, developers can add
           | infra that they need, QA can increase infra resources to
           | ensure that sufficient resources are available for expected
           | scale.
           | 
           | The problem most organizations have is that they're not
           | actually willing to force everyone's concerns to be written
           | in code, and people are forbidden from reaching outside their
           | silos. Usually this is due to poor hiring and training
           | practice, e.g. "QA doesn't know how to write code" or "we
           | can't let developers touch security policy". Sometimes it is
           | due to leadership itself misunderstanding DevOps ("developers
           | are forbidden from touching production").
           | 
           | > If DevOps was a real thing there would be a need for
           | LegalOps, DevSales, DevProduct, HROps, etc.
           | 
           | There _is_ such a need, and it is generally being fulfilled
           | by the systems in place. Small example - HROps would be
           | ensuring that changes in the organization (i.e. people moving
           | teams) accurately results in proper loss (of old) and gain
           | (in new) privileges. This is done with integration between HR
           | systems (the system of record as to who reports to whom and
           | what their responsibilities are) and Active Directory or
           | Google Workspace  / Google Groups ensuring that people are
           | automatically moved between the groups to which permissions
           | are granted.
        
             | zsoltkacsandi wrote:
             | > Wrong! This tendency is exactly what DevOps explains is
             | the natural state of affairs without DevOps principles. The
             | solution that DevOps advocates for is that such conflicting
             | interests must not be expressed in meetings (where the
             | culture conflict ensures they will get nowhere) but rather
             | expressed in code.
             | 
             | Yes, there is the theory, principles, advocates, etc, etc.
             | And there is the reality, and based on many-many years of
             | experience as a dev, as an ops, and as a manager at from
             | small-sized to enterprise companies, the reality isn't even
             | close to this.
             | 
             | > Infrastructure must be in code, deploys must be in code,
             | testing must be in code, builds must be in code, policy
             | must be in code, and the implicit pipeline with all the
             | handoffs between teams connecting them all must also be in
             | code.
             | 
             | This is not DevOps, and you don't need DevOps for this.
             | This is just about having an engineering mindset.
             | 
             | > so that people can reach outside of their natural
             | organizational silos to propose changes elsewhere in the
             | pipeline
             | 
             | Do you know how many times have I seen a developer touching
             | Terraform code, ansible playbooks, or pipelines described
             | as code? I am not saying that it never happened, but it was
             | a rare occasion.
             | 
             | > The problem most organizations have is that they're not
             | actually willing to force everyone's concerns to be written
             | in code
             | 
             | I managed such an enforcement and change. It did not solve
             | the cultural and collaboration issues.
             | 
             | > HROps would be ensuring that changes in the organization
             | (i.e. people moving teams) accurately results in proper
             | loss
             | 
             | This is just a matter of SoPs and workflows. It has nothing
             | to do with the topic.
             | 
             | No offense, but your comment is a perfect reflection of why
             | DevOps is a flawed concept. You are talking about
             | enforcement of everything described as code, advocates,
             | principles, etc.
             | 
             | If there is a good culture and collaboration the
             | infrastructure as code, the advocates, etc will come
             | naturally. People will find a way to collaborate. But not
             | the other way around, these won't fix the culture.
        
             | nijave wrote:
             | While reaching across silos seems good, in theory, my
             | experience is there's just too much breadth and domain
             | knowledge for it to work consistently.
             | 
             | Sure I know application code and have worked with a handful
             | of frameworks, but if I'm enforcing infrastructure or
             | performance concerns and implementing across a handful of
             | different services, it's extremely time consuming getting
             | up to speed in each repo and understanding the subtleties
             | and patterns between each one.
             | 
             | I can optimize queries and debug performance issues but the
             | usual roadblock is understanding what the code is
             | _supposed_ to do and whether an optimization provides the
             | correct results (which is not always clear from tests,
             | assuming good ones exist)
        
       | qwertox wrote:
       | SRE == Site Reliability Engineering.
       | 
       | Quoting Wikipedia:
       | 
       | Site Reliability Engineering (SRE) is a discipline in the field
       | of Software Engineering that monitors and improves the
       | availability and performance of deployed software systems, often
       | large software services that are expected to deliver reliable
       | response times across events such as new software deployments,
       | hardware failures, and cybersecurity attacks[1]. There is
       | typically a focus on automation and an Infrastructure as code
       | methodology. SRE uses elements of software engineering, IT
       | infrastructure, web development, and operations[2] to assist with
       | reliability. It is similar to DevOps as they both aim to improve
       | the reliability and availability of deployed software systems.
       | 
       | https://en.wikipedia.org/wiki/Site_reliability_engineering
        
         | tetris11 wrote:
         | Thank you. Ridiculous that every other acronym was defined
         | except the one in the title..
        
         | doublerabbit wrote:
         | Modern day SysAdmin.
         | 
         | SysOp > SysAdmin > SRE
         | 
         | No different to what I've been doing for the past 15 years. Web
         | 2.0 needed a new buzzword is all.
         | 
         | SREs are System Admins who come from the development
         | background.
         | 
         | System Admins come from System Operators background.
         | 
         | Uptime is easy when people actually listen to what I have to
         | say or listen to NetOps.
         | 
         | Rather than DevOps throwing $next technology at everything or
         | "needing" 100x more X because their codebase lacks.
        
           | jmillikin wrote:
           | SREs are programmers who specialize in writing programs that
           | manage complex distributed systems.
           | 
           | If you hire SREs and have them doing sysadmin work, then (1)
           | you're massively over-paying and (2) they'll get bored and
           | leave once they find a role that makes better use of their
           | skills.
           | 
           | If you hire sysadmins for SRE work, they'll get lost the
           | first time they need to write a kernel module or design a
           | multi-continent data replication strategy.
        
             | doublerabbit wrote:
             | I stand ish corrected. Feels the same difference to of that
             | of a Senior Sysadmin. I do both, I wouldn't call myself
             | SRE.
        
             | dijit wrote:
             | > If you hire sysadmins for SRE work, they'll get lost the
             | first time they need to write a kernel module or design a
             | multi-continent data replication strategy.
             | 
             | Ah yes, the old (incorrect) mantra of "sysadmins couldn't
             | code". Which is ironic, as the vast majority of the
             | abstractions that you'll interface with are written by
             | sysadmins.
        
               | qwertox wrote:
               | IDK, writing things like kernel modules to improve the
               | reliability of a complex system doesn't really sound like
               | a task sysadmins get paid for.
               | 
               | Yes, a lot of coding (mostly in scripting languages) is
               | normal, mostly to automate tasks and improve visibility
               | into the system, to make data digestible for tools like
               | Grafana, but other optimizations seem to be out of
               | bounds.
               | 
               | But I most likely do lack the insights you have.
        
               | dijit wrote:
               | I've written kernel code to do various anti-ddos stuff,
               | however its the exception for sure.
               | 
               | Debugging complex systems is _more_ in the wheelhouse of
               | sysadmins. When I came up it was a requirement for
               | sysadmins to be proficient in C, a commandline debugger
               | (usually gdb), the unix /linux syscall interface
               | (understanding everything that comes out of strace for
               | example) and perl.
               | 
               | Usually those perl scripts ended up becoming an
               | orchestration/automation platform of some kind- ruby
               | replaced perl at some point. I guess it's python and Go
               | now?
               | 
               | The modern "kernel module" requirement is more likely to
               | be a kubernetes operator or terraform module, and the
               | modern day sysadmin definitely writes those (the rest of
               | the role is essentially identical, just tools got better)
        
       | rzz3 wrote:
       | These days I wonder if Google is really the example to follow.
       | There was a time 10 or 15 years ago where Google seemed to be
       | leading the industry in everything, and I feel like a lot of
       | people still think they do when it comes to engineering culture.
       | These days I tend to see Google as a bit of a red flag on a
       | resume, and I have a set of questions I ask to make sure they
       | didn't drink too much of the koolaid. Perhaps more importantly,
       | when I look at Google from the outside these days, I see that
       | their products have really gone downhill in terms of quality. I
       | see Google Search riddled with spam, I see Gemini struggling to
       | keep up with OpenAI, Google Chat trying to keep up with Slack but
       | missing the mark, Nest being stagnant, I could go on and on. All
       | this to say that I don't think Google is the North Star that it
       | used to be in terms of guiding engineering culture throughout the
       | industry.
        
         | dehrmann wrote:
         | I agree on the product and customer service front, but Google's
         | reliability is top-notch.
        
           | mirashii wrote:
           | As a Google Cloud customer, I'd say it might be best to split
           | Google into some divisions or something, as Google Cloud's
           | reliability is a relative shitshow compared to Google.com.
        
         | pphysch wrote:
         | Who is then?
        
           | scarface_74 wrote:
           | From a product standpoint every BigTech company has done
           | better at releasing new products than Google.
        
             | pphysch wrote:
             | How did we get from SRE culture to (paraphrasing) "I
             | personally think Google makes worse products than IBM,
             | Oracle, Apple, Netflix, Broadcom, et al."
        
               | scarface_74 wrote:
               | Having good technology and good products are orthogonal.
               | People are conflating the two
        
               | hollowsunsets wrote:
               | What defines a good product? Something that many
               | customers use? Something that makes shareholders happy?
        
               | scarface_74 wrote:
               | A product that either moves the needle as far as revenue
               | and/or makes the ecosystem better. It also needs to be a
               | product that gets continuously better as long as there is
               | a market for it and not abandoned quickly.
               | 
               | - "a connected TV device". How many cancelled lines of
               | products have they abandoned? How many market failures
               | have they had in their own line of phones? The Pixel's
               | aren't taking the world by storm and they spent billions
               | on Motorola and then sold it off for scraps
               | 
               | They have been releasing a cancelling their own tablet
               | initiatives for years.
               | 
               | At one point they had 5 separate messaging initiatives
               | going on simultaneously.
               | 
               | Even today they have three operating system initiatives
               | that are not on the same codebase - Fuscia, Android and
               | ChromeOS.
               | 
               | They have basically abandoned Flutter and don't use it
               | for any of their high profile apps.
               | 
               | What have they actually done besides ads?
               | 
               | And the obvious evidence is their money losing "other
               | bets"
               | 
               | Also Google Fiber
               | 
               | https://www.spglobal.com/marketintelligence/en/news-
               | insights...
        
         | jofla_net wrote:
         | this came to mind
         | 
         | https://www.spiceworks.com/tech/data-management/news/google-...
        
           | sanj wrote:
           | Fixed about a week later:
           | 
           | https://support.google.com/drive/thread/245861992/drive-
           | for-...
        
             | znpy wrote:
             | it shouldn't have happened in the first place.
        
         | taeric wrote:
         | I'm curious what you have in mind for evidence of "koolaid"
         | there?
         | 
         | Hard not to disagree with the general trend you are outlining.
         | Most of that feels driven by product choices, moreso than
         | execution. I think a lot of the previous glorification of their
         | work was likely misguided, as well. But I would be hard pressed
         | to be quantitative on that.
        
           | scarface_74 wrote:
           | It was 5 years after Android was introduced that the CEO
           | stopped using BlackBerry...
        
             | taeric wrote:
             | An amusingly good quantification of some evidence. Well
             | done! :D
             | 
             | Still, I don't have much to say that I think the
             | engineering was overly good or bad. I typically think that
             | what they captured for a short while, at least, was
             | enthusiasm. In particular, developers were enthusiastic to
             | be near Google technology in a way that I don't think I've
             | seen for other companies, since.
             | 
             | I don't think they identified it as such, though. Which
             | could be why they seem slow to see that a lot of that has
             | evaporated.
             | 
             | Not to say that they have no enthusiasm, now. I'd wager
             | they still have a lot. But as a percentage share of all
             | developers, it feels very different.
        
         | scarface_74 wrote:
         | I would never hire a _product_ person from Google or someone I
         | needed to be visionary. For the most part, their products suck,
         | they have no vision and no follow through.
         | 
         | But their _technology_ is top notch. I hire mostly for startups
         | and green field initiatives though and I wouldn't hire anyone
         | from any BigTech company unless I had "hard" technical problems
         | to solve.
         | 
         | Yes I've done a stint at BigTech.
        
           | ninkendo wrote:
           | They have top notch tech, yes, but it's massively overkill
           | for literally every company that's not at google's scale. If
           | you're not careful you may hire someone who will try to
           | replicate everything google does, when you may need only 1%
           | of the complexity. This is the experience I've generally had
           | with xooglers... they lament that they don't have the same
           | tools/tech stack they had at google, and so their first act
           | is to try to move everything to the closest open source
           | equivalents, even if they're not a good fit.
           | 
           | There's good things and bad things to take away from
           | experience at google... you have to be careful to ignore
           | things that won't actually help you.
        
             | scarface_74 wrote:
             | I agree. I haven't run into a "hard problem" in my career
             | 
             | By hard problem I mean technically at the top 5% of a
             | problems in the industry that can't be solved by throwing
             | money at a SaaS or using a cloud provider.
        
               | everfrustrated wrote:
               | Most companies are really just crud apps. Very few are
               | doing anything technically innovative. And that's just
               | fine.
               | 
               | I wonder how much of the early Google technical
               | innovation was more a product of open source
               | tech/distributed systems being a lot more immature (I'm
               | particularly thinking databases) 25 years ago.
               | 
               | Ultimately all companies get bloated and loose their way.
               | It shouldn't be a suprise this has happened to Google -
               | 25 years on they are mega corp and idling. Probably for
               | the best as it allows innovators a chance to compete.
        
             | deepsun wrote:
             | I've been the "you're not google" person for several years,
             | but now softened my position.
             | 
             | The thing is -- it depends. Sometimes when everyone knows
             | some complex system well -- it becomes easy.
             | 
             | One example comes to mind -- Kubernetes. 90% of teams don't
             | need all its complexity. And I've been "you don't need it"
             | person for some time. But now I see that when everyone
             | knows it -- it's actually much easier to deploy even simple
             | websites on it, because it's a common lingo and you don't
             | spend time explaining how it's deployed.
             | 
             | It's not like civic engineers, when an over-engineered
             | bridge would cost a lot more in materials.
        
               | scarface_74 wrote:
               | If you have a simple website , you can containerize your
               | backend and use much simpler services from AWS and serve
               | your static assets on S3.
               | 
               | Kubernetes is rarely the right answer for simple things
               | even if Docker is.
        
               | mschuster91 wrote:
               | > much simpler services from AWS
               | 
               | Like what, Lambda? I've seen so much horrible hacks and
               | shit done with it (and other AWS services _cough_ API
               | gateway _cough_ ), these days I rather prefer a set of
               | Kubernetes descriptors and Dockerfiles.
               | 
               | At least that combination all but _enforces_ people doing
               | Infrastructure-as-a-code and there 's (almost) no
               | possibility at all for "had to do live hack XYZ in the
               | console and forgot to document it or apply it back in
               | Terraform" .
        
               | scarface_74 wrote:
               | AWS App Runner
               | 
               | https://aws.amazon.com/blogs/containers/introducing-aws-
               | app-...
               | 
               | Google has something similar.
        
               | icedchai wrote:
               | GCP has Cloud Run, which looks similar. App Runner is
               | basically a wrapper on top of Fargate, right?
        
               | scarface_74 wrote:
               | Yep. Every "serverless" compute service is just a wrapper
               | on top of Firecracker including Lambda and Fargate.
        
               | icedchai wrote:
               | In my experience, you are better off with ECS/Fargate
               | than Lambda for serving an API. You get much more
               | flexibility.
               | 
               | Also, I've witnessed people editing Lambda code through
               | the console instead of doing a real deploy. what a
               | mess...
        
               | scarface_74 wrote:
               | You can't edit Lambda code in the console when you deploy
               | a Docker image to Lambda.
               | 
               | As far as flexibility, while there have been third party
               | libraries that let you deploy your standard Node/Express,
               | .Net/ASP, Python/Flask app to Lambda, now there is an
               | official first party solution
               | 
               | https://github.com/awslabs/aws-lambda-web-adapter
               | 
               | And as far as ECS, it is stupid simple
               | 
               | I've used this for years
               | 
               | https://github.com/1Strategy/fargate-cloudformation-
               | example/...
        
               | mschuster91 wrote:
               | > Also, I've witnessed people editing Lambda code through
               | the console instead of doing a real deploy. what a
               | mess...
               | 
               | Yeah, exactly that's what I am talking about. Utter
               | nightmare to recover from, especially if whoever set it
               | up thought they needed to follow some uber-complex AWS
               | blog post with >10 involved AWS services and didn't
               | document any of it.
        
               | scarface_74 wrote:
               | You can't edit Lambda code directly in the console when
               | using Docker deployments
        
               | deepsun wrote:
               | Your response is the perfect example of my point. Each
               | time you use "much simpler services" you still _need to
               | explain_ the setup for the simpler services. Someone
               | might know it, someone not. E.g. some project may
               | eventually grow out of Lambda RAM limitations, but noone
               | in the team knew that. While Kubernetes is one-size-fits-
               | all setup, even if I don't like it.
               | 
               | And yes, I use the Cloud Run myself, but only for my one-
               | person projects. For the team projects consistency is
               | much more important (same way to access/monitor/version
               | etc).
               | 
               | PS: I would say even AWS/GCP is already a huge overkill
               | for most projects. But for some reason you didn't see
               | exactly the same problem starting with clouds right away.
        
               | scarface_74 wrote:
               | Lambda can use up to 10GB of RAM and there is also App
               | Runner.
               | 
               | And "using AWS" can be as simple as starting off with
               | Lightsail - a VPS service with fixed pricing
               | 
               | https://aws.amazon.com/lightsail/
        
               | deepsun wrote:
               | RAM is just one example. Every simpler service has its
               | limitations, and if everyone (including new hires) knows
               | the simpler service well -- it's perfect. E.g. in my
               | experience everyone knew App Engine at some point and it
               | worked well for us. Now it's a zoo of devops pieces, so I
               | tolerate Kubernetes only because everyone kinda knows it.
               | 
               | And the Kubernetes was just one example of my "you're not
               | google" point. There is many more technologies that are
               | definitely overkill, but is a good common denominator,
               | even when it's 1000x more complex than needed for the
               | task at hand.
               | 
               | PS: Btw, I dunno why people downvoted your comment. It's
               | fits the HN "Guidelines" at the bottom, so upvoted.
        
               | elktown wrote:
               | The problem is that it can create a chain-reaction of
               | complexity _because_ it opens up possibilities for over-
               | engineering. In the sense of:  "Yes, it's a bit over-
               | engineered, but k8s makes it manageable for us anyway!" -
               | consciously or subconsciously. When I'd often suspect
               | that some restrictions in what's possible/acceptable
               | would've created a significantly leaner overall design in
               | the end.
        
             | VirusNewbie wrote:
             | > but it's massively overkill for literally every company
             | that's not at google's scale.
             | 
             | Before I worked at Google I was at a small telecom company
             | that was running into limits of what some of the
             | Dataflow/Apache Beam product could do, so we had to rewrite
             | it (and commit it back to Beam).
             | 
             | There _are_ companies that have massive scaling issues even
             | if they 're not planet scale cloud providers or something.
             | 
             | You can replicate a lot of Google tech now by....just using
             | the OSS they release and/or jumping on a modern cloud
             | provider (GCP or AWS). It's not 2012. You can use a good
             | database and not have to reinvent it.
        
           | jarsin wrote:
           | And now we are all stuck doing leetcode interviews primarily
           | because of Google.
        
             | marssaxman wrote:
             | Leetcode didn't exist back then; the site was founded a
             | little less than a decade ago.
        
               | jarsin wrote:
               | I was using "Leetcode" as in style of the interview. The
               | Leetcode website was founded due to everyone in the
               | industry copying google/big tech in these style of
               | interviews.
        
               | mike_hearn wrote:
               | Google copied that idea from Microsoft, primarily.
        
           | fweimer wrote:
           | The flip side from this assumption about their technology is
           | that if some service is not working, people are very quick to
           | blame the impacted (paying) user. "You are running into rate
           | limits", "Google is applying anti-abuse controls to your
           | account", and so on. But at least for some services, I
           | strongly suspect we are actually experiencing random system
           | failures. In my experience, it's rare to get acknowledgement
           | of this from Google. Tickets may not even make it to them
           | because of this pervasive "Google technology is perfect"
           | assumption. The end result feels a bit like gaslighting
           | (doubting our sanity because we can't spot the pattern that
           | is supposed to be obvious): we are encouraged to attribute
           | meaning to more or less random reactions from a complex
           | system.
        
         | brudgers wrote:
         | Unless you have Google sized problems and resources, Google
         | probably is not the best example because the things Google does
         | are done to address Google size problems with Google sized
         | resources. It's tooling and methods are not commercial
         | products.
         | 
         | For example, Google can get away with the flaws of it's AI
         | search results because it is Google.
        
         | jeffbee wrote:
         | The fact that some people prefer ChatGPT over Gemini is not
         | something that SRE can help you with. The fact that ChatGPT is
         | rarely available is something that SRE could help Microsoft
         | avoid.
        
           | lupire wrote:
           | ChatGPT is rarely available??
        
             | jeffbee wrote:
             | They have major, long-lasting incidents at least once a
             | week. https://status.openai.com/
        
         | yodsanklai wrote:
         | > There was a time 10 or 15 years ago where Google seemed to be
         | leading the industry in everything
         | 
         | They used to write interesting books and articles about
         | software engineering. It felt that they were maintaining high
         | quality standards and were an industry reference. Nowadays, I
         | wouldn't go as far as saying it's a red flag to have Google on
         | one's resume, but definitely not the same appeal as before.
        
           | nimish wrote:
           | Goodhart's Law for employment
        
       | jph wrote:
       | The article describes Causal Analysis based on Systems Theory
       | (CAST) which is akin to many-factor root cause analysis.
       | 
       | I am a big fan of CAST for software teams, and of MIT Prof. Nancy
       | Leveson who leads CAST.
       | 
       | My CAST summary notes for tech teams:
       | 
       | https://github.com/joelparkerhenderson/causal-analysis-based...
       | 
       | MIT CAST Handbook:
       | 
       | http://sunnyday.mit.edu/CAST-Handbook.pdf
        
         | pulkitsh1234 wrote:
         | Are there any resources to show how to apply this in practice?
         | This is too theoretical to grok for me, there are too many
         | terms. It seems too time-consuming to understand (and to
         | perform IMO)
        
           | jph wrote:
           | > This is too theoretical to grok for me
           | 
           | Here's a fast, easy, practical way to think about CAST:
           | 
           | 1. Causal: Novices may believe accidents are due to one "root
           | cause" or a few "probable causes", but it turns out that
           | accidents are actually due to many interacting causes.
           | 
           | 2. Analysis: Novices may blame people, but it's smarter to do
           | blame-free examination of why the loss occurred, and how it
           | occurred i.e. "ask why and how, not who".
           | 
           | 3. Systems: Novices may fix just one thing that broke, but it
           | turns out it's better to discover multiple causes, then plan
           | multiple ways to improve the whole system.
        
             | m0nkee wrote:
             | like your points
        
         | materielle wrote:
         | I was listening to a Titus Winters podcast, and I'm not sure he
         | exactly put it like this, but I took it away as:
         | 
         | There are two problems with automated testing. 1) tests take
         | too long to run 2) difficult to root cause breakages.
         | 
         | Most devs solve this with making unit tests ever more granular
         | with heavy use of mocks/fakes. This "solves" both problems in a
         | narrow sense: the tests run faster and are obvious to root
         | cause breakages.
         | 
         | But you didn't actually solve the problem. Since the entire
         | point of writing tests in the first place was to answer the
         | question: "does my system work"? Granular and mocked unit tests
         | don't help much.
         | 
         | However, going back to the original question, we can actually
         | reframe the problems as: 1) a work scheduling problem and 2) a
         | signal processing problem.
         | 
         | Those are pretty well understood problems with good solutions.
         | It's just that this is a somewhat novel way of thinking of
         | tests, so it hasn't really been integrated into the open source
         | tool chain.
         | 
         | You could imagine integration tests automatically be correlated
         | to a micro service release. Some CI automation constantly
         | running expensive tests over a range of commits and
         | automatically bisecting on failure. Etc.
         | 
         | Put another way, automated tests don't go far enough. We need
         | yet another higher layer of abstraction. Computers are better
         | at deciding what tests to run and when, and are also better at
         | interpreting the results.
        
           | azurelake wrote:
           | > Put another way, automated tests don't go far enough. We
           | need yet another higher layer of abstraction. Computers are
           | better at deciding what tests to run and when, and are also
           | better at interpreting the results.
           | 
           | Sounds like you might be interested in
           | https://antithesis.com/ (no affiliation).
        
         | typesanitizer wrote:
         | Thanks for writing the summary notes and sharing those here.
         | After reading the Usenix article, I was thinking that we could
         | apply some of the ideas at $WORK, but the exact "How" was still
         | not super clear. Your notes offer a compact and accessible
         | starting point without having to ask colleagues to dive in to a
         | 100+ page PDF. :D
        
       | MPSimmons wrote:
       | This reminds me very much of Sidney Dekker's work, particularly
       | The Field Guide to Understanding Human Failure, and Drift Into
       | Failure.
       | 
       | The former focuses on evaluating the system as a whole, and
       | identifying the state of mind of the participants of the
       | accidents and evaluating what led them to believe that they were
       | making the correct decisions, with the understanding that nobody
       | wants to crash a plane.
       | 
       | The latter book talks more about how multiple seemingly
       | independent changes to complex loosely coupled systems can
       | introduce gaps in safety coverage that aren't immediately
       | obvious, and how those things could be avoided.
       | 
       | I think the CAST approach looks appealing. It seems as though it
       | does require a lot of analysis of failures and near-misses to be
       | best utilized, and the hardest part of implementing it will
       | undoubtably be the people, who often take the "there wasn't a
       | failure, why should we spend time and energy investigating a
       | success" mindset.
        
         | jph wrote:
         | Yes you're 100% right. Dekker is a valuable complement to CAST
         | & STAMP because Dekker emphasizes people aspects of psychology,
         | goals, beliefs, etc., while CAST emphasizes engineering aspects
         | of processes, practices, metrics, etc.
         | 
         | CAST describes how to pragmatically bring together the people
         | aspects and the engineering aspects, by having stakeholders
         | write a short explicit safety philosophy:
         | 
         | https://github.com/joelparkerhenderson/safety-philosophy
        
       | FuriouslyAdrift wrote:
       | I think the single biggest thing about Google SREs (at least in
       | the early years) was that if your team was going to launch a new
       | product, you had to have an SRE to help and to maintain the
       | service.
       | 
       | Google deliberately limited the amount of SREs, so you had to
       | prove your stuff worked and sell it to the SRE to even get a
       | chance to launch.
       | 
       | Constraints help to make good ideas better...
        
         | hollowsunsets wrote:
         | It's not good when you have an SRE on hand to act as a
         | babysitter of sorts. That is how some companies use SREs these
         | days. They do the toil and sysadmin work so the product
         | engineers can focus on features. Exactly what we hoped to
         | avoid, but here we are.
        
           | sgarland wrote:
           | If by some you mean nearly all, then yes, and yes, it's
           | terrible.
           | 
           | Super fun being the adult in the room having to explain for
           | the millionth time why someone can't expect that a network
           | call will always succeed, and will never experience latency.
        
           | nvarsj wrote:
           | Even at Google it's like this. I spent the holidays watching
           | my on-call Google SRE friend trying to diagnose misbehaving
           | mapreduce jobs devs had written. They are basically glorified
           | first line support so SWEs don't get woken up in the night.
           | 
           | Which seems like the worst possible setup to me - devs should
           | be first on call for code they write. That seems like a basic
           | principle to me and creates the correct incentives.
        
         | arthurjj wrote:
         | Thanks for this detail, I worked at Google, with SREs, and
         | didn't know it. It seems like the type of 'design' detail that
         | might be more important than this entire article
        
         | emtel wrote:
         | This culture was, imo, directly responsible for google's
         | failure to launch a facebook competitor early enough for it to
         | matter.
         | 
         | The Orkut project was basically banned from being launched or
         | marketed as an official google product because it was deemed
         | "not production ready" by SRE. Despite that it gained huge
         | market share in Brazil and a few other countries before
         | eventually losing to FB. By the time their "production ready"
         | product (G+) launched it was hilariously late.
         | 
         | Facebook probably would have won anyway, but who knows what
         | might have happened if Google had actually leaned into this
         | very successful project instead of treating it like an unwanted
         | step-child.
        
           | mike_hearn wrote:
           | How was it banned from being launched? It did launch and the
           | desire to not be promoted as a Google product came from Orkut
           | himself, iirc.
           | 
           | The reason it was not regarded as 'production ready' was that
           | the architecture didn't scale. In fact it also didn't run on
           | the regular Google infrastructure that everything else used
           | and that SRE teams were familiar with; it was a .NET app that
           | used MS SQL Server.
           | 
           | This design wasn't a big surprise. Facebook won not because
           | Orkut lost but because Facebook were the first to manage
           | gradual scaleup without killing their own adoption, by
           | figuring out naturally isolated social networks they could
           | restrict signup to (American universities). This made their
           | sharding problem much easier. Every other competitor tried to
           | let the whole world sign up simultaneously whilst also
           | offering sophisticated data model features like the ability
           | to make arbitrary friendships and groups, which was hard to
           | implement with the RDBMS tech of the time.
           | 
           | Orkut did indeed suffer drastic scaling problems and
           | eventually had to be rewritten on top of the regular Google
           | infrastructure, but that just replaced one set of problems
           | with another.
        
             | ghaff wrote:
             | Of course restricting rollout to American university emails
             | (including alumni addresses--at least at one point) was
             | also a pretty natural consequence of Facebook's origins.
        
             | emtel wrote:
             | The attitude within SRE toward Orkut (the product) was one
             | of disdain if not contempt. A healthy culture does not
             | treat rapidly growing products this way.
        
               | mike_hearn wrote:
               | I mean, I'm personal friends with a former Orkut SRE. The
               | idea that Google SRE ignored or disdained Orkut just
               | isn't right. Nonetheless, if your job is defined as "make
               | the site reliable" and it's written in a way that can
               | never be reliable then it's understandable that you're
               | going to have at least some frustrations.
        
         | philll wrote:
         | That sounds dysfunctional.
        
       | crabbone wrote:
       | I wish this article was at most a quarter of its current length.
       | Preferably even shorter. There's so much self-congratulatory and
       | empty talk, it's really hard to get to the main point.
       | 
       | I think, the most important (and actually valuable) part is the
       | mention of the work done by someone else (STPA and CAST). That's
       | all there is to the article. Read about Causal Analysis based on
       | Systems Theory (CAST) and System-Theoretic Process Analysis
       | (STPA) do what the book says.
        
         | anal_reactor wrote:
         | Agreed that the whole article could've been much shorter.
         | Anyway, for me the key takeaway is not to trust your inputs.
         | It's true that code correctness often boils down to "given
         | input X, the program will correctly give output Y", but the
         | actual issue is that sometimes the input X itself might be
         | wrong. I think it's clearly visible in project management,
         | where people tell you one thing, you plan accordingly, then
         | later they do another thing, and if you haven't predicted this,
         | you're done. If this behavior is so common in human projects in
         | general, I see no reason why it wouldn't emerge in software
         | projects too.
         | 
         | The problem is, software that tries to do something smart with
         | inputs is much harder to reason about, which in turn increases
         | your likelihood of failure, which is exactly the thing you
         | wanted to avoid in the first place. For example, imagine you
         | have an edge case in your script where you want to perform "rm
         | -rf /" but the safety mechanism prevents you from doing this,
         | which effectively makes your script fail.
         | 
         | In conclusion, in my humble opinion, the most important part of
         | safety is choosing tools that are simplest to reason about. If
         | you have a bash script you're guaranteed to have some bug
         | related to some edge case - people managing POSIX realized that
         | bash is so fundamentally broken that it's better to forbid
         | certain filenames rather than fix bash. Use a Python library
         | for 10x the safety but half the comfort. If you have a C++
         | program it will leak memory no matter how hard you try. And so
         | on.
         | 
         | Similarly, when writing programs, you should give simple and
         | strong promises about its API. Don't ever do "program accepts
         | most sensible date strings and tries to parse that", do "it's
         | either this specific format or an error".
         | 
         | Verifying inputs and being smart about them is a good idea that
         | should be used carefully because it can backfire spectacularly.
        
         | herodoturtle wrote:
         | Came here to say this.
         | 
         | It's not the first article / publication on Google SRE I've
         | read, and they're all similarly (and imho unnecessarily)
         | verbose.
         | 
         | Whilst I'm deeply grateful to the good folks at Google for
         | sharing their hard-earned knowledge with us, I do wish their
         | publications on this important topic were far more succinct.
        
       | cudgy wrote:
       | Article is about an acronym and yet never states what the acronym
       | SRE means.
        
         | packetslave wrote:
         | Not everything needs to be spelled out for people who are too
         | lazy to Google
        
         | n0n0n4t0r wrote:
         | It does... If you read the authors bio in the footprint:)
        
       | abotsis wrote:
       | Couple thoughts here: 1. The "rightsizer" example mentioned might
       | well have had the same outcome if the outage was analyzed in a
       | "traditional" sense. That said, it is much easier and more
       | actionable with this new approach. 2. I've always hated software
       | testing because faults can occcur external to the software being
       | tested. It's difficult to reason about those if you have a myopic
       | view of just your component of in system. This line of thinking
       | somewhat fixes that- or at least paves a path to fixing that.
       | 
       | Unfortunately, while this article says a lot, much just repeated
       | itself and I'd wish there was more detail. For example: who all
       | is involved in this process? Are there limits on what can be
       | controlled? How (politically) does this all shake out with
       | respect to the relationships between SREs and software engineers?
       | Etc..
        
         | wilson090 wrote:
         | Agreed, the devil is in the detail for SRE functions, and the
         | organizational details of how to leverage this framework are
         | largely absent from this writeup. With so many teams struggling
         | to get the organizational components right just for traditional
         | SRE (due to budget constraints, internal politics,
         | misunderstanding of SRE by leadership, etc), I'd imagine
         | implementing the changes need to leverage the ideas in this
         | writeup will be impossible for all but extremely deep-pocketed
         | tech companies.
         | 
         | Nonetheless, lots of interesting concepts, so I would like to
         | see a Google SRE handbook style writeup with more info that
         | might be of more practical value.
        
       | cmckn wrote:
       | SWEs: are SRE/devops folks part of your day to day?
       | 
       | I have never been in a SWE role where I didn't do my own "ops",
       | even at FAANG (I haven't worked at Google). I know "SRE/devops"
       | was/is buzzy in the industry, but it's always seemed, in the vast
       | majority of cases, to be a half-assed rebrand of the old school
       | "operations" role -- hardly a revolution. In general, I think my
       | teams have benefited from doing our own ops. The software
       | certainly has. Am I living in a bubble?
        
         | fragmede wrote:
         | I'm assuming in the ops side of your role, you're not filling
         | in firewall rules paperwork to an network team, spinning up new
         | servers to SSH in and SCP some files over and edit a couple of
         | config files though. Operations just doesn't look like that
         | anymore, so the fact that SWE teams can now do a meaningful
         | amount of operations for their product _is_ the revolution. It
         | may not feel like it if you weren 't doing operations the old
         | way, but there are a lot of tools that are invisible to make
         | things work.
        
         | baalimago wrote:
         | SRE and DevOps is better summarized as 'cloud engineering',
         | IMO. Basically, it's to set up and maintain the infrastructure
         | which allows you to do your own ops as a dev.
        
           | cmckn wrote:
           | That's my impression as well. My SWE team has always done all
           | of that ourselves, I've never felt the need for a dedicated
           | role to maintain IaC and click around in the console.
        
       | n0n0n4t0r wrote:
       | I wonder a what scale this very interesting approach start
       | yielding more value than cost. What I mean is: is it a faang only
       | as so many things they seeded or is it actually relevant at a
       | non-faang scale?
       | 
       | I tend to be invest much on risk avoidance, so this is appealing
       | to me, but I know that my risk avoidance tendency is rarely
       | shared by my peers/stakeholders.
        
         | mgaunard wrote:
         | The example here seems to do with sizing appropriately the
         | requirements of applications, which enables you to schedule
         | more applications per machine, driving down costs.
         | 
         | This is useful for any company larger than say 10 people.
         | 
         | In general this is difficult to do, because there is more at
         | play than memory, CPU and disk usage, especially if you have
         | certain performance requirements.
         | 
         | I find that what's in Kubernetes (a Google product) pretty much
         | useless, but maybe it works for web tech.
        
           | n0n0n4t0r wrote:
           | I understood their example more like: automating the scaling
           | of servers is easy, having proper inputs for this scaling to
           | be reliable is hard.
           | 
           | What they propose is to lend weeks if engineering time to
           | perform analysis in the hope to find some relatable issues.
           | Are both this engineering time and the issues fixing time
           | relevant for non faang companies?
           | 
           | In other words: The lever effect of not having issues is
           | fewer, so the rentability of such analysis decrease. Where
           | does the rentability become negative?
        
             | mgaunard wrote:
             | In practice it's pretty much impossible to get precise
             | requirements without automatically learning them from how
             | the application performed in the past.
             | 
             | The problem is that it is high-risk to automatically
             | perform those changes since they might affect the
             | application in ways you do not expect.
        
               | n0n0n4t0r wrote:
               | I really don't think they are talking about requirements,
               | at least not specifically. Aren't you focusing on your
               | own level issues?
        
               | mgaunard wrote:
               | the example they gave is the quota rightsizer. Its job is
               | to infer the right quota (requirement allocation).
        
               | n0n0n4t0r wrote:
               | Yes, but I mean that they shifted the focus to the input
               | measurements over correct quota's value.
        
         | mike_hearn wrote:
         | It's definitely something that requires a high budget and a
         | dedicated reliability team. In most orgs that have got as far
         | as a proper post-mortem and analysis culture, they aren't even
         | reliably draining the action items generated by the post
         | mortems, so attempting to pre-emptively generate action items
         | is kind of a moot point.
        
       | lamontcg wrote:
       | > Looking at a data flow diagram with more than 100 nodes is
       | overwhelming--where do you even begin to search for flaws?
       | 
       | Yeah, so maybe try not to build anything that complex to start
       | with.
        
         | __turbobrew__ wrote:
         | Yea that was my take away too. Maybe limit the depth of any RPC
         | call?
        
           | lamontcg wrote:
           | Also, just don't try to be Google and don't microservice the
           | crap out of everything...
           | 
           | The problem is that too many people in our industry are
           | trying to get experience to land a job at Google so they try
           | to turn every job into Google...
           | 
           | Although honestly I suspect that Google could do more to
           | simplify internally, but that is the kind of work that
           | doesn't get you promoted, while layers of additional smart-
           | sounding complexity do.
           | 
           | And it really sounds to me like we've gone wrong as an
           | industry, where you can't bolt together lego blocks and get
           | working larger systems out of them, and have to worry about
           | large scale spooky-action-at-a-distance effects. It is like
           | having to worry about the interaction of your radio with your
           | car's drive train. A simple fuse keeps the radio from killing
           | the engine, and then the designer of the engine never has to
           | think about the radio.
        
       | ashepp wrote:
       | I've been reading about CAST (Causal Analysis based on Systems
       | Theory) and noticed some interesting parallels with mechanistic
       | interpretability work. Rather than searching for root causes,
       | CAST provides frameworks for analyzing how system components
       | interact and why they "believe" their decisions are correct -
       | which seems relevant to understanding neural networks.
       | 
       | I'm curious if anyone has tried applying formal safety
       | engineering frameworks to neural net analysis. The methods for
       | tracing complex causal chains and system-level behaviors in CAST
       | seem potentially useful, but I'd love to hear from people who
       | understand both fields better than I do. Is this a meaningful
       | connection or am I pattern-matching too aggressively?
        
         | triclops200 wrote:
         | I do AI/ML research for a living (my degrees were in
         | theoretical CS and AI/ML and my [unfinished] phD work was in
         | computational creativity [essentially AGI]). I also do SRE work
         | as a living.
         | 
         | and yeah that's a useful way of characterizing some of the
         | behaviors of some kinds of neural networks. There's a point at
         | which the distinction between belief and "frequency (or
         | probability-amplitude) state filter" become less apparent,
         | though, that's more of a function-of-medium vs function-of-
         | system distinction.
         | 
         | However, systems like these can often become mediums,
         | themselves, for more complex systems. Additionally, a system
         | which has "closed-the-loop" by understanding the medium and the
         | system as coupled as "self" and separate from the environment
         | along with a direction/goal is a pretty decent, if imprecise,
         | definition of a strange loop. Contradiction resolution between
         | internal component beliefs gives a possible (imo, highly
         | probable) mechanistic explaination for the phenomenon of free
         | energy minimization in such systems. External contradiction
         | resolution extends it to active inference.
        
       | georgewfraser wrote:
       | Like so many things from Google engineering this will be toxic to
       | your startup. SREs read stuff like this, they get main character
       | syndrome and start redoing the technical designs of all the other
       | teams, and not in a good way.
       | 
       | This phenomenon can occur in all "overlay" functions, for example
       | the legal department will try to run the entire company if you
       | don't have a good leader who keeps the team in their lane.
        
         | physhster wrote:
         | In my experience, SREs are usually "enforcers of
         | maintainability". If your engineers don't want to be oncall,
         | they need to produce applications and services that are
         | documented and maintainable. It's an amazing forcing function.
         | SRE doesn't often redo technical designs, there's plenty enough
         | reliability and scalability work to do...
        
           | jshen wrote:
           | Your engineers should be on call.
        
             | physhster wrote:
             | At a 200-person company, sure. But when you're in the tens
             | or hundreds of thousands, that's a hard no. Especially when
             | dealing with out-of-scope dependencies.
        
               | otterley wrote:
               | I work for a company with millions of employees. Our SDEs
               | and their managers carry and are responsible for
               | answering the pagers. We don't have SREs.
        
         | la64710 wrote:
         | From the 90s the whole DNS on which the internet is standing
         | today was run successfully with minimum error by a bunch of
         | folks who used to call themselves sysadmins. Developers seems
         | to run out of things to develop and they have been reinventing
         | themselves as devops and SREs. They have been pushing out pure
         | sysadmins but at the same time this trend shows how demand for
         | developers or SWEs falls far short of the supply of developers
         | in the market.
        
           | tsss wrote:
           | Take one look at the Kubernetes source code and it becomes
           | clear that you can make successful software with zero clue
           | about good software engineering.
        
       ___________________________________________________________________
       (page generated 2025-01-04 23:01 UTC)