[HN Gopher] Google's SRE Book (2017)
       ___________________________________________________________________
        
       Google's SRE Book (2017)
        
       Author : udev4096
       Score  : 139 points
       Date   : 2024-03-03 12:08 UTC (10 hours ago)
        
 (HTM) web link (sre.google)
 (TXT) w3m dump (sre.google)
        
       | overbytecode wrote:
       | Can any SREs tell us how applicable this book is today? Is it
       | still a useful read?
        
         | oooyay wrote:
         | Yeah, it is, but there's also a lot more to being a SRE than
         | this book. This book more or less tells you how to stand up a
         | reliability program, what it doesn't really indicate is _what
         | SREs do_. A lot of people I meet think SRE is just the new
         | title for  "operator" which can't be farther from the truth.
         | Whether you're doing an embedding model, like is referenced in
         | the book, or you have a central org - both are made up of
         | software and systems software engineers that are focused on
         | performance and reliability. They build software, do analysis,
         | and write policy that improve the bottom line reliability of
         | the organization.
        
         | pm90 wrote:
         | Not an SRE, but I think the main contribution from this book
         | was to popularize terminology of operations (eg SLAs) and to
         | give an opinionated perspective on how to handle operations at
         | scale.
         | 
         | More practically, I don't think the book is as useful, as it
         | generally only makes sense when you reach a certain scale that
         | few organizations ever do (imo).
         | 
         | However, we are heading into a future where computing will be
         | everywhere and sensors in everything so in maybe a decade even
         | the "smallest" of organizations may be responsible for large
         | scale distributed systems and operating that would require
         | concepts that are provided in the book.
        
         | arter4 wrote:
         | As a non Googler myself, it still is if you want to know how to
         | set up an SRE team and introduce SRE (ie good sysadmin, for
         | lack of a better word) best practices. The focus on actual
         | indicators such as SLI and SLO, the importance of reducing
         | "toil" (boring repetitive tasks) and automating,... these are
         | all valid concerns.
         | 
         | If you want more about system design and how to design
         | reliability, I suggest reading
         | https://google.github.io/building-secure-and-reliable-system...
        
         | bananapub wrote:
         | yes, but not as a checklist of things you have to do, instead
         | it's a valuable discussion of lots of problems and how they
         | were solved in specific circumstances.
         | 
         | learn from it, don't copy from it.
        
         | Rapzid wrote:
         | The front half is for introducing ideas. The back chapters
         | where never that great IMHO. They get both too in the weeds and
         | at the same time missing actionable advice.
        
       | jupp0r wrote:
       | Read this, even if you are far away from any actual operation of
       | the systems you work on. Read it especially then.
       | 
       | Learning the princples and philosophy conveyed in that book
       | helped me tremendously in my career (as a software engineer).
       | Thanks people at Google for writing and open sourcing it.
        
         | hrpnk wrote:
         | The important bit to remember when reading is to understand the
         | origin (why) of the concepts. I've seen Engineers being too
         | dogmatic about the book, saying that "Google does it this way"
         | and not being able to apply the concepts to one's own
         | organizational context. Even at Google, there will be different
         | teams who will deviate from the "described process" given their
         | business context, setup, or stakeholders.
        
           | jeffbee wrote:
           | > Even at Google, there will be different teams who will
           | deviate from the "described process" given their business
           | context, setup, or stakeholders.
           | 
           | Seriously. A lot of the book was influenced by Social SRE who
           | had opinions all out of proportion to their own importance
           | and success. At the time, there was some doubt about whether
           | Social's pet theories belonged in the book, considering the
           | varying practices and beliefs of other SRE groups supporting
           | products that people actually use.
           | 
           | This is related to my rule that anybody can title their doc
           | "Best Practices" even if nobody subscribes to them.
        
             | esafak wrote:
             | Can you name a better book?
        
               | jeffbee wrote:
               | None off the top of my head but for the engineering
               | topics in Part III I think it pays to read them as
               | historical background material, then read the last decade
               | of conference papers, talks, and articles.
               | 
               | Example: the section on backend subsetting in distributed
               | systems is not current. If you wanted the current Google
               | practice you need to read "Reinventing Backend Subsetting
               | at Google"[1], and there are other interesting
               | publications from other organizations.
               | 
               | 1: https://queue.acm.org/detail.cfm?id=3570937
        
             | brandall10 wrote:
             | Can you please expound on what Social SRE is? Is some
             | public handle of an engineer, a name for an internal group,
             | or something else?
        
               | bananapub wrote:
               | the SRE teams that worked around various Google "Social"
               | [media] products
        
       | rozenmd wrote:
       | There's still good advice in the book, but be aware it was
       | published in 2016, with folks likely having started writing it
       | around 2014.
       | 
       | Both Google and SRE/DevOps have advanced greatly since then, and
       | following the book blindly would be cargo culting.
       | 
       | Edit: apparently this is a controversial opinion?
        
         | esafak wrote:
         | What contemporary book do you recommend?
        
         | Rapzid wrote:
         | How has it advanced greatly since then?
        
           | lrem wrote:
           | Most of the tools I've been using when my colleagues were
           | writing the book are either gone, or half-forgotten
           | abandonware. The new tools were built for different
           | processes, system layout and organisational structures.
        
             | Rapzid wrote:
             | Book barely talks about tools. It wasn't about tools. The
             | epiphany for many was the concept of an error budget and
             | establishing SLOs. Then, basing investment in reliability
             | on data.
             | 
             | That's as applicable today as it was then.
        
       | softirq wrote:
       | Most companies completely missed the point of SRE/PE/DevOps and
       | keep them on separate teams doing sysadmin toil work and oncall
       | thrown over the wall by engineers who are only concerned with
       | feature deadlines. They regress them back to sysadmin duties and
       | get none of the value of a true SRE program.
       | 
       | SRE should always be a subtitle for a SWE and not a separate
       | position, and they should always be embedded with SWEs into one
       | team either building products of infrastructure. The shared
       | ownership and toil reduction _only_ works if you have these two
       | things.
       | 
       | All this said, I think the regression is also due to the fact
       | that real SREs are rare. A solid SWE that also has deep systems
       | domain knowledge, understanding how to sift through dashboards
       | and live data, and root cause complex performance problems is a
       | master of many domains and is hard to find.
        
         | snowfield wrote:
         | The regression is also due to that a real SRE is expensive.
         | It's cheaper to just get some newly grads to react to alarms
         | following a set runbook of what to do if that alarm triggers.
         | 
         | VERY few companies operate at googles scale. For 99.99% of
         | companies it makes sense to investigate single machine issues.
        
           | bananapub wrote:
           | Google SREs also end up investigating single machine issues,
           | fyi.
        
             | jeffbee wrote:
             | Yes, but At Scale(r)
             | 
             | It's a totally different experience when you have the
             | people who technically own the hardware side of the
             | operations taking no responsibility for the well-being of
             | it, and the people who own the software developing
             | elaborate workarounds for bad machines, and the SREs
             | maintaining blacklists of individual nodes.
        
       | amelius wrote:
       | Does it have a chapter on how to deal with end-user support?
        
         | michaelt wrote:
         | For more information on Google's end-user support, please post
         | on the Community Forum.
        
         | snowfield wrote:
         | I think googles advice is: don't
        
       | dijit wrote:
       | What I found most fascinating is that Google essentially
       | rediscovered what is _important_ in a sysadmin and codified the
       | contract between feature developers and reliability roles.
       | 
       | Instead of having feature developers feeling like they have no
       | say in operational requirements, and instead of having
       | reliability staff fighting unstable mess: properly making the
       | contract means everyone gets heard.
       | 
       | Contrasted to devops which despite coming out later was in vogue
       | when this book came out; which caused the muddying of the role of
       | sysadmin to meaning either:
       | 
       | * a sysadmin practicing agile (the original definition btw)
       | 
       | * a software engineer with enough OS skills to carry the pager
       | (the popular one), or
       | 
       | * a team consisting of sysadmins _and_ software developers with
       | no barrier between them (10+ deploys per day style).
       | 
       | Everyone had their own definition of DevOps. So when SRE came
       | clear: sysadmins are needed, stop trying to push everything into
       | one person, heres how we fix the tension between teams: it was a
       | breath of fresh air.
       | 
       | The only revisionist history (that even google seems to forget)
       | is that sysadmins could indeed write code, though it wasnt pretty
       | and didnt have the nice things like mocks and tests. This has
       | changed a little since 2010 at least but it is still dire, even
       | with Cloud making things much easier.
       | 
       | *EDIT:* I've gone from +4 to 0 points in a very, very rapid
       | amount of time. If I have offended you; how?
        
         | jupp0r wrote:
         | I don't think that's how the SRE role is described in the book.
         | It shifts the boundaries between sysadmins and developers to
         | give developers much more freedoms in deciding the operational
         | parameters of how their softwane is run/released but also gives
         | SREs the ability to push back if they get handed over some
         | piece of software that's not sufficiently following recommended
         | patterns that make actually operating it mostly automated.
        
           | dijit wrote:
           | That's exactly right.
           | 
           | Can you please point to the part of my comment that you think
           | disagrees with this?
        
             | jupp0r wrote:
             | I think I might have misclicked and responded to a
             | different comment than I intended to. You were basically
             | saying the same thing :)
        
         | arccy wrote:
         | i would think most people object to calling SREs sysadmins.
        
           | dijit wrote:
           | thats unfortunate, I think that means they arent aware that
           | the job of SRE is functionally identical to a sysadmin from
           | 2005 in terms of responsibilities and required knowledge.
           | 
           | We just live on a higher level of abstraction and have better
           | tools & processes now.
        
             | rvnx wrote:
             | Compared to 2005, we live in a society where a lot of
             | people are very sensitive to words.
             | 
             | For example, "developer" is considered offensive, because
             | for some people it's very important to be called "software
             | engineer".
             | 
             | Really good developers don't care about titles.
             | 
             | They don't have time to worry about such, or they have so
             | much money / experience that even if you call them "smart
             | monkeys" they'll be happy with it.
             | 
             | Same goes with sysadmins, SREs, devops, or whatever role
             | you choose.
             | 
             | For some people they have shitty jobs: they don't have such
             | recognition (whether for a good reason or not).
             | 
             | No recognition from work, no recognition from colleagues,
             | no recognition financially, etc, that, if you remove them
             | the title / prestige, obviously they would feel bad.
             | 
             | Source: my experience in a school calling itself
             | "engineering school", and all other schools calling it a
             | "place where to pee code"
        
               | dataflow wrote:
               | > Really good developers don't care about titles. They
               | don't have time to worry about such, or they have so much
               | money / experience that even if you call them "smart
               | monkeys" they'll be happy with it.
               | 
               | That's about the money, not about being good at your
               | work. Ask anyone on the street if you can call them a rat
               | in exchange for a million dollar salary and they'll say
               | yes. It's quite simple.
        
             | achiang wrote:
             | I was a sysadmin (at uni, in the early 2000s) and I am an
             | SRE today (at Google).
             | 
             | The two jobs are nothing alike, at all, whatsoever.
             | 
             | Sysadmins are support roles. Their functional role is to
             | provide a healthy substrate to run the application layer on
             | top of.
             | 
             | SREs work at the application layer itself. If the system
             | can't scale due to internal architecture, an SRE would be
             | expected to propose a new, scalable design. That would be
             | in _addition_ to maintaining the substrate.
             | 
             | To be clear, there is also nothing inferior about
             | performing a support role. No org can succeed without
             | support.
             | 
             | But the two roles are not the same, and if a job's set of
             | responsibilities don't include shared ownership over
             | application layer architecture, then it can be a great job
             | but it's not an SRE role.
        
           | dilyevsky wrote:
           | there's a sort of bifurcation of SRE responsibilities with
           | one branch focusing on software-enabled automation and the
           | other on "systems engineering" aka sysadmin. Both are called
           | SREs at Google which seems to cause widespread confusion
           | externally (and even within Google). Also see so called "Ben
           | Treynor Curve"[0]
           | 
           | [0] - https://www.usenix.org/system/files/login/articles/logi
           | n_jun...
        
             | lrem wrote:
             | Within Google: Treynor's curve is a hiring concept. Once
             | in, you're doing literally the same job. Being in a team
             | doing greenfield development it took me three years to
             | notice that my TL, with whom we've been defining the
             | mathematical model, designing and implementing the system,
             | is on the SysEng ladder.
        
           | throwaway5752 wrote:
           | This might be an age-related perception. I think if you're
           | over 40, you'd consider this complementary to SREs. The role
           | of sysadmin, as it existed in 2000, is almost unimaginable
           | now.
        
           | ozr wrote:
           | As someone that has had those titles for many years in the
           | past: I'm only going to object to the one that's paying less
           | at the time.
           | 
           | A good sysadmin was always doing the same thing as a good
           | SRE.
        
         | ranger207 wrote:
         | Sysadmin originally meant Unix greybeards who were competent at
         | C and could write and implement, say, Kerberos. Then in the
         | 90s-2000s the term came to primarily refer to Windows admins
         | clicking around Active Directory and Group Policy for large
         | enterprises. The "sysadmins can't code" thing came grom that
         | time and the early 2010s when all the cool startups were
         | building on Linux and the available pool of sysadmins were
         | largely Windows specialists. Then DevOps came along trying to
         | get Windows sysadmins that were dabbling in Linux into modern
         | development practices, and SRE came along trying to revitalise
         | the old Unix greybeard style with modern software development
         | practices.
        
         | oooyay wrote:
         | That may what sysadmin meant back in the 80s and 90s but it's
         | not what it means now. I would never describe myself, a SRE-SE,
         | as a sysadmin because I would be describing someone whose
         | primary job is to operate software and iterate on
         | configuration.
         | 
         | SRE-SEs on the other hand are SWEs, they just have a focus in
         | systems adjacent software whereas a SRE-SWE is someone who can
         | dig into compiler level issues and optimization. Both write
         | application code, do analysis, and write policy. A sysadmin of
         | today would be out of place on a team like that.
        
         | alexey-salmin wrote:
         | * SRE are sysadmins with Go       * DevOps are sysadmins with
         | Python       * Sysadmins are sysadmins with Perl
         | 
         | (stolen from twitter, can't find the source)
        
           | jimbokun wrote:
           | I'm a software engineer that also does operations work. But I
           | really love Go for operations type tasks. Simple, direct,
           | code that's not too verbose, with predictable performance,
           | and fast compilation to a single executable. I've even taken
           | to baking an editor and the compiler into an image with a
           | small library of code for operational tasks, that I can
           | quickly edit and run in the deployed environment for ad how
           | operational tasks.
        
             | oblio wrote:
             | Image?
        
           | paulddraper wrote:
           | What if they have PowerShell?
        
       | ChrisArchitect wrote:
       | Related bunch of discussion from a few months ago:
       | 
       |  _Lessons Learned from Twenty Years of Site Reliability
       | Engineering_
       | 
       | https://news.ycombinator.com/item?id=38037141
        
         | dang wrote:
         | Thanks! Macroexpanded:
         | 
         |  _Lessons Learned from Twenty Years of Site Reliability
         | Engineering_ - https://news.ycombinator.com/item?id=38037141 -
         | Oct 2023 (124 comments)
         | 
         |  _Google Online SRE Books_ -
         | https://news.ycombinator.com/item?id=31373170 - May 2022 (11
         | comments)
         | 
         |  _What Is 'Site Reliability Engineering'?_ -
         | https://news.ycombinator.com/item?id=14153545 - April 2017 (86
         | comments)
         | 
         |  _Site Reliability Engineering_ -
         | https://news.ycombinator.com/item?id=13503161 - Jan 2017 (111
         | comments)
         | 
         |  _Notes on Google 's Site Reliability Engineering Book_ -
         | https://news.ycombinator.com/item?id=11474002 - April 2016 (93
         | comments)
        
       | poisonta wrote:
       | I'm curious whether the success of Google in launching software
       | that seems not fully developed can be attributed to their Site
       | Reliability Engineering (SRE) practices.
        
         | bananapub wrote:
         | this is a dumb comment, but yes, part of the role of SREs was
         | helping people make (and then implement) trade-offs around
         | system deployment while deploying things that basically worked
         | as intended.
        
           | eichin wrote:
           | As I understand it (from friends who were SREs in the 2010s)
           | the really clever bit was that projects basically had a
           | _budget_ for  "how much SRE attention your deployment needed"
           | - so there was payoff for getting more deployment details
           | right the first time, and structural pushback for just
           | throwing things over the wall. Sounded like an interesting
           | way to connect up the levers...
        
       | ChrisArchitect wrote:
       | In looking at the Book Updates section
       | (https://sre.google/resources/book-update/) there's a bunch of
       | companion articles and resources but has there been any actual
       | updates to the book since 2017?
        
         | Rapzid wrote:
         | The other books.
        
       | yashness wrote:
       | Google is anyway planning to shut down SRE role & transitioning
       | them to SWE role predominantly. A few months back there has been
       | announcement already & one of the reflections is to start with
       | reducing the numbers - https://archive.ph/YWp4O
        
         | danpalmer wrote:
         | This doesn't say SRE is shutting down, it says that they're
         | changing the ratio of SRE to SWE. One thing to realise about
         | Google is that the technology is increasingly unified across
         | the company. 10 years ago everything worked in different ways,
         | but now there are very standard technologies and paths, and
         | naturally this requires fewer SREs to the SWEs developing the
         | products. I don't think this is a bad thing, and in the layoffs
         | SREs have not to my knowledge been hit any harder than SWEs.
        
         | rwiggins wrote:
         | It has indeed been a strange time for Google SRE recently.
         | However, they're definitely not planning on shutting down SRE -
         | at least, if you can trust what Google leadership's actual
         | explanation of what that meant.
         | 
         | Supposedly, the ratio of SRE to product eng had been growing
         | slowly over the years. The KR to "readjust" that ratio was to
         | bring it back in line with historical norms, i.e., to ensure
         | that SRE continued to scale sub-linearly with SWE/systems. This
         | had (primarily) two facets.
         | 
         | First, it gave SRE teams an effectively-blank check to
         | reevaluate their existing dev engagements and jettison the ones
         | that weren't working well.
         | 
         | Second, it pushed to eliminate old tools/systems/platforms and
         | converge onto the more modern stuff, like Annealing [1]. Fewer
         | crufty platforms means fewer teams needed to run them, and
         | improvements in those platforms have broad impact.
         | 
         | Anecdotally, my own sub-org (within SRE) is _growing_ at the
         | moment. Not by a huge amount, but growing nonetheless.
         | 
         | [1]: https://www.usenix.org/publications/loginonline/prodspec-
         | and...
        
         | Unfrozen0688 wrote:
         | Ofc everyone downsizing... smh
        
       | yashness wrote:
       | Google is anyway reducing SRE task force & is planning to
       | completely eliminate SRE role. There has been recent announcement
       | already & have already started the move -
       | https://archive.ph/YWp4O
        
         | hedora wrote:
         | The archive link is busted for me, but that sounds like a bad
         | move. 90+% of SWEs are bad at and hate SRE work, and vice
         | versa.
         | 
         | The rare ones that can do a mediocre job at both (and that
         | won't burn out and switch jobs if told to do both) are usually
         | not capable of doing an excellent job at either.
         | 
         | Using analogies from pretty much any other field shows how dumb
         | it is to combine SRE and SWE, or fuse DevOps (or, god forbid
         | DevSecOps) into one rule:
         | 
         | - Would you have a surgeon drive an ambulance?
         | 
         | - An expert car mechanic manage fleet scheduling and logistics?
         | 
         | - Tell a salesman to design marketing graphics, and have your
         | graphic designer manage high-value customer accounts?
        
       | habitue wrote:
       | The SRE book is a little more advertisement of Google's internal
       | systems than real actionable advice outside of Google for SREs.
       | 
       | There is some generally useful stuff in there, but it probably
       | fits in a few pages vs a full book.
        
         | lrem wrote:
         | The second book is actually actionable, according to coauthors.
        
           | kyrra wrote:
           | By second book, do you mean the SWE book?
           | https://abseil.io/resources/swe-book
           | 
           | It was written by titans with the SWE ladder at Google,
           | fairly disconnected from the SRE book.
        
             | arccy wrote:
             | no, the sre workbook https://sre.google/workbook/table-of-
             | contents/
        
         | lulznews wrote:
         | It's really an incredible marketing piece
        
         | pquki4 wrote:
         | Only a few paragraphs into the book:
         | 
         | > One continual challenge Google faces is hiring SREs: not only
         | does SRE compete for the same candidates as the product
         | development hiring pipeline, but the fact that we set the
         | hiring bar so high in terms of both coding and system
         | engineering skills means that our hiring pool is necessarily
         | small.
         | 
         | I was thinking, ok so does this mean the book is completely
         | useless for most companies in the world, since they don't have
         | such standards for hiring people or run DevOps this way? How
         | much of the rest of the book is still applicable?
        
       ___________________________________________________________________
       (page generated 2024-03-03 23:00 UTC)