[HN Gopher] Google's SRE Book (2017)
___________________________________________________________________
Google's SRE Book (2017)
Author : udev4096
Score : 139 points
Date : 2024-03-03 12:08 UTC (10 hours ago)
(HTM) web link (sre.google)
(TXT) w3m dump (sre.google)
| overbytecode wrote:
| Can any SREs tell us how applicable this book is today? Is it
| still a useful read?
| oooyay wrote:
| Yeah, it is, but there's also a lot more to being a SRE than
| this book. This book more or less tells you how to stand up a
| reliability program, what it doesn't really indicate is _what
| SREs do_. A lot of people I meet think SRE is just the new
| title for "operator" which can't be farther from the truth.
| Whether you're doing an embedding model, like is referenced in
| the book, or you have a central org - both are made up of
| software and systems software engineers that are focused on
| performance and reliability. They build software, do analysis,
| and write policy that improve the bottom line reliability of
| the organization.
| pm90 wrote:
| Not an SRE, but I think the main contribution from this book
| was to popularize terminology of operations (eg SLAs) and to
| give an opinionated perspective on how to handle operations at
| scale.
|
| More practically, I don't think the book is as useful, as it
| generally only makes sense when you reach a certain scale that
| few organizations ever do (imo).
|
| However, we are heading into a future where computing will be
| everywhere and sensors in everything so in maybe a decade even
| the "smallest" of organizations may be responsible for large
| scale distributed systems and operating that would require
| concepts that are provided in the book.
| arter4 wrote:
| As a non Googler myself, it still is if you want to know how to
| set up an SRE team and introduce SRE (ie good sysadmin, for
| lack of a better word) best practices. The focus on actual
| indicators such as SLI and SLO, the importance of reducing
| "toil" (boring repetitive tasks) and automating,... these are
| all valid concerns.
|
| If you want more about system design and how to design
| reliability, I suggest reading
| https://google.github.io/building-secure-and-reliable-system...
| bananapub wrote:
| yes, but not as a checklist of things you have to do, instead
| it's a valuable discussion of lots of problems and how they
| were solved in specific circumstances.
|
| learn from it, don't copy from it.
| Rapzid wrote:
| The front half is for introducing ideas. The back chapters
| where never that great IMHO. They get both too in the weeds and
| at the same time missing actionable advice.
| jupp0r wrote:
| Read this, even if you are far away from any actual operation of
| the systems you work on. Read it especially then.
|
| Learning the princples and philosophy conveyed in that book
| helped me tremendously in my career (as a software engineer).
| Thanks people at Google for writing and open sourcing it.
| hrpnk wrote:
| The important bit to remember when reading is to understand the
| origin (why) of the concepts. I've seen Engineers being too
| dogmatic about the book, saying that "Google does it this way"
| and not being able to apply the concepts to one's own
| organizational context. Even at Google, there will be different
| teams who will deviate from the "described process" given their
| business context, setup, or stakeholders.
| jeffbee wrote:
| > Even at Google, there will be different teams who will
| deviate from the "described process" given their business
| context, setup, or stakeholders.
|
| Seriously. A lot of the book was influenced by Social SRE who
| had opinions all out of proportion to their own importance
| and success. At the time, there was some doubt about whether
| Social's pet theories belonged in the book, considering the
| varying practices and beliefs of other SRE groups supporting
| products that people actually use.
|
| This is related to my rule that anybody can title their doc
| "Best Practices" even if nobody subscribes to them.
| esafak wrote:
| Can you name a better book?
| jeffbee wrote:
| None off the top of my head but for the engineering
| topics in Part III I think it pays to read them as
| historical background material, then read the last decade
| of conference papers, talks, and articles.
|
| Example: the section on backend subsetting in distributed
| systems is not current. If you wanted the current Google
| practice you need to read "Reinventing Backend Subsetting
| at Google"[1], and there are other interesting
| publications from other organizations.
|
| 1: https://queue.acm.org/detail.cfm?id=3570937
| brandall10 wrote:
| Can you please expound on what Social SRE is? Is some
| public handle of an engineer, a name for an internal group,
| or something else?
| bananapub wrote:
| the SRE teams that worked around various Google "Social"
| [media] products
| rozenmd wrote:
| There's still good advice in the book, but be aware it was
| published in 2016, with folks likely having started writing it
| around 2014.
|
| Both Google and SRE/DevOps have advanced greatly since then, and
| following the book blindly would be cargo culting.
|
| Edit: apparently this is a controversial opinion?
| esafak wrote:
| What contemporary book do you recommend?
| Rapzid wrote:
| How has it advanced greatly since then?
| lrem wrote:
| Most of the tools I've been using when my colleagues were
| writing the book are either gone, or half-forgotten
| abandonware. The new tools were built for different
| processes, system layout and organisational structures.
| Rapzid wrote:
| Book barely talks about tools. It wasn't about tools. The
| epiphany for many was the concept of an error budget and
| establishing SLOs. Then, basing investment in reliability
| on data.
|
| That's as applicable today as it was then.
| softirq wrote:
| Most companies completely missed the point of SRE/PE/DevOps and
| keep them on separate teams doing sysadmin toil work and oncall
| thrown over the wall by engineers who are only concerned with
| feature deadlines. They regress them back to sysadmin duties and
| get none of the value of a true SRE program.
|
| SRE should always be a subtitle for a SWE and not a separate
| position, and they should always be embedded with SWEs into one
| team either building products of infrastructure. The shared
| ownership and toil reduction _only_ works if you have these two
| things.
|
| All this said, I think the regression is also due to the fact
| that real SREs are rare. A solid SWE that also has deep systems
| domain knowledge, understanding how to sift through dashboards
| and live data, and root cause complex performance problems is a
| master of many domains and is hard to find.
| snowfield wrote:
| The regression is also due to that a real SRE is expensive.
| It's cheaper to just get some newly grads to react to alarms
| following a set runbook of what to do if that alarm triggers.
|
| VERY few companies operate at googles scale. For 99.99% of
| companies it makes sense to investigate single machine issues.
| bananapub wrote:
| Google SREs also end up investigating single machine issues,
| fyi.
| jeffbee wrote:
| Yes, but At Scale(r)
|
| It's a totally different experience when you have the
| people who technically own the hardware side of the
| operations taking no responsibility for the well-being of
| it, and the people who own the software developing
| elaborate workarounds for bad machines, and the SREs
| maintaining blacklists of individual nodes.
| amelius wrote:
| Does it have a chapter on how to deal with end-user support?
| michaelt wrote:
| For more information on Google's end-user support, please post
| on the Community Forum.
| snowfield wrote:
| I think googles advice is: don't
| dijit wrote:
| What I found most fascinating is that Google essentially
| rediscovered what is _important_ in a sysadmin and codified the
| contract between feature developers and reliability roles.
|
| Instead of having feature developers feeling like they have no
| say in operational requirements, and instead of having
| reliability staff fighting unstable mess: properly making the
| contract means everyone gets heard.
|
| Contrasted to devops which despite coming out later was in vogue
| when this book came out; which caused the muddying of the role of
| sysadmin to meaning either:
|
| * a sysadmin practicing agile (the original definition btw)
|
| * a software engineer with enough OS skills to carry the pager
| (the popular one), or
|
| * a team consisting of sysadmins _and_ software developers with
| no barrier between them (10+ deploys per day style).
|
| Everyone had their own definition of DevOps. So when SRE came
| clear: sysadmins are needed, stop trying to push everything into
| one person, heres how we fix the tension between teams: it was a
| breath of fresh air.
|
| The only revisionist history (that even google seems to forget)
| is that sysadmins could indeed write code, though it wasnt pretty
| and didnt have the nice things like mocks and tests. This has
| changed a little since 2010 at least but it is still dire, even
| with Cloud making things much easier.
|
| *EDIT:* I've gone from +4 to 0 points in a very, very rapid
| amount of time. If I have offended you; how?
| jupp0r wrote:
| I don't think that's how the SRE role is described in the book.
| It shifts the boundaries between sysadmins and developers to
| give developers much more freedoms in deciding the operational
| parameters of how their softwane is run/released but also gives
| SREs the ability to push back if they get handed over some
| piece of software that's not sufficiently following recommended
| patterns that make actually operating it mostly automated.
| dijit wrote:
| That's exactly right.
|
| Can you please point to the part of my comment that you think
| disagrees with this?
| jupp0r wrote:
| I think I might have misclicked and responded to a
| different comment than I intended to. You were basically
| saying the same thing :)
| arccy wrote:
| i would think most people object to calling SREs sysadmins.
| dijit wrote:
| thats unfortunate, I think that means they arent aware that
| the job of SRE is functionally identical to a sysadmin from
| 2005 in terms of responsibilities and required knowledge.
|
| We just live on a higher level of abstraction and have better
| tools & processes now.
| rvnx wrote:
| Compared to 2005, we live in a society where a lot of
| people are very sensitive to words.
|
| For example, "developer" is considered offensive, because
| for some people it's very important to be called "software
| engineer".
|
| Really good developers don't care about titles.
|
| They don't have time to worry about such, or they have so
| much money / experience that even if you call them "smart
| monkeys" they'll be happy with it.
|
| Same goes with sysadmins, SREs, devops, or whatever role
| you choose.
|
| For some people they have shitty jobs: they don't have such
| recognition (whether for a good reason or not).
|
| No recognition from work, no recognition from colleagues,
| no recognition financially, etc, that, if you remove them
| the title / prestige, obviously they would feel bad.
|
| Source: my experience in a school calling itself
| "engineering school", and all other schools calling it a
| "place where to pee code"
| dataflow wrote:
| > Really good developers don't care about titles. They
| don't have time to worry about such, or they have so much
| money / experience that even if you call them "smart
| monkeys" they'll be happy with it.
|
| That's about the money, not about being good at your
| work. Ask anyone on the street if you can call them a rat
| in exchange for a million dollar salary and they'll say
| yes. It's quite simple.
| achiang wrote:
| I was a sysadmin (at uni, in the early 2000s) and I am an
| SRE today (at Google).
|
| The two jobs are nothing alike, at all, whatsoever.
|
| Sysadmins are support roles. Their functional role is to
| provide a healthy substrate to run the application layer on
| top of.
|
| SREs work at the application layer itself. If the system
| can't scale due to internal architecture, an SRE would be
| expected to propose a new, scalable design. That would be
| in _addition_ to maintaining the substrate.
|
| To be clear, there is also nothing inferior about
| performing a support role. No org can succeed without
| support.
|
| But the two roles are not the same, and if a job's set of
| responsibilities don't include shared ownership over
| application layer architecture, then it can be a great job
| but it's not an SRE role.
| dilyevsky wrote:
| there's a sort of bifurcation of SRE responsibilities with
| one branch focusing on software-enabled automation and the
| other on "systems engineering" aka sysadmin. Both are called
| SREs at Google which seems to cause widespread confusion
| externally (and even within Google). Also see so called "Ben
| Treynor Curve"[0]
|
| [0] - https://www.usenix.org/system/files/login/articles/logi
| n_jun...
| lrem wrote:
| Within Google: Treynor's curve is a hiring concept. Once
| in, you're doing literally the same job. Being in a team
| doing greenfield development it took me three years to
| notice that my TL, with whom we've been defining the
| mathematical model, designing and implementing the system,
| is on the SysEng ladder.
| throwaway5752 wrote:
| This might be an age-related perception. I think if you're
| over 40, you'd consider this complementary to SREs. The role
| of sysadmin, as it existed in 2000, is almost unimaginable
| now.
| ozr wrote:
| As someone that has had those titles for many years in the
| past: I'm only going to object to the one that's paying less
| at the time.
|
| A good sysadmin was always doing the same thing as a good
| SRE.
| ranger207 wrote:
| Sysadmin originally meant Unix greybeards who were competent at
| C and could write and implement, say, Kerberos. Then in the
| 90s-2000s the term came to primarily refer to Windows admins
| clicking around Active Directory and Group Policy for large
| enterprises. The "sysadmins can't code" thing came grom that
| time and the early 2010s when all the cool startups were
| building on Linux and the available pool of sysadmins were
| largely Windows specialists. Then DevOps came along trying to
| get Windows sysadmins that were dabbling in Linux into modern
| development practices, and SRE came along trying to revitalise
| the old Unix greybeard style with modern software development
| practices.
| oooyay wrote:
| That may what sysadmin meant back in the 80s and 90s but it's
| not what it means now. I would never describe myself, a SRE-SE,
| as a sysadmin because I would be describing someone whose
| primary job is to operate software and iterate on
| configuration.
|
| SRE-SEs on the other hand are SWEs, they just have a focus in
| systems adjacent software whereas a SRE-SWE is someone who can
| dig into compiler level issues and optimization. Both write
| application code, do analysis, and write policy. A sysadmin of
| today would be out of place on a team like that.
| alexey-salmin wrote:
| * SRE are sysadmins with Go * DevOps are sysadmins with
| Python * Sysadmins are sysadmins with Perl
|
| (stolen from twitter, can't find the source)
| jimbokun wrote:
| I'm a software engineer that also does operations work. But I
| really love Go for operations type tasks. Simple, direct,
| code that's not too verbose, with predictable performance,
| and fast compilation to a single executable. I've even taken
| to baking an editor and the compiler into an image with a
| small library of code for operational tasks, that I can
| quickly edit and run in the deployed environment for ad how
| operational tasks.
| oblio wrote:
| Image?
| paulddraper wrote:
| What if they have PowerShell?
| ChrisArchitect wrote:
| Related bunch of discussion from a few months ago:
|
| _Lessons Learned from Twenty Years of Site Reliability
| Engineering_
|
| https://news.ycombinator.com/item?id=38037141
| dang wrote:
| Thanks! Macroexpanded:
|
| _Lessons Learned from Twenty Years of Site Reliability
| Engineering_ - https://news.ycombinator.com/item?id=38037141 -
| Oct 2023 (124 comments)
|
| _Google Online SRE Books_ -
| https://news.ycombinator.com/item?id=31373170 - May 2022 (11
| comments)
|
| _What Is 'Site Reliability Engineering'?_ -
| https://news.ycombinator.com/item?id=14153545 - April 2017 (86
| comments)
|
| _Site Reliability Engineering_ -
| https://news.ycombinator.com/item?id=13503161 - Jan 2017 (111
| comments)
|
| _Notes on Google 's Site Reliability Engineering Book_ -
| https://news.ycombinator.com/item?id=11474002 - April 2016 (93
| comments)
| poisonta wrote:
| I'm curious whether the success of Google in launching software
| that seems not fully developed can be attributed to their Site
| Reliability Engineering (SRE) practices.
| bananapub wrote:
| this is a dumb comment, but yes, part of the role of SREs was
| helping people make (and then implement) trade-offs around
| system deployment while deploying things that basically worked
| as intended.
| eichin wrote:
| As I understand it (from friends who were SREs in the 2010s)
| the really clever bit was that projects basically had a
| _budget_ for "how much SRE attention your deployment needed"
| - so there was payoff for getting more deployment details
| right the first time, and structural pushback for just
| throwing things over the wall. Sounded like an interesting
| way to connect up the levers...
| ChrisArchitect wrote:
| In looking at the Book Updates section
| (https://sre.google/resources/book-update/) there's a bunch of
| companion articles and resources but has there been any actual
| updates to the book since 2017?
| Rapzid wrote:
| The other books.
| yashness wrote:
| Google is anyway planning to shut down SRE role & transitioning
| them to SWE role predominantly. A few months back there has been
| announcement already & one of the reflections is to start with
| reducing the numbers - https://archive.ph/YWp4O
| danpalmer wrote:
| This doesn't say SRE is shutting down, it says that they're
| changing the ratio of SRE to SWE. One thing to realise about
| Google is that the technology is increasingly unified across
| the company. 10 years ago everything worked in different ways,
| but now there are very standard technologies and paths, and
| naturally this requires fewer SREs to the SWEs developing the
| products. I don't think this is a bad thing, and in the layoffs
| SREs have not to my knowledge been hit any harder than SWEs.
| rwiggins wrote:
| It has indeed been a strange time for Google SRE recently.
| However, they're definitely not planning on shutting down SRE -
| at least, if you can trust what Google leadership's actual
| explanation of what that meant.
|
| Supposedly, the ratio of SRE to product eng had been growing
| slowly over the years. The KR to "readjust" that ratio was to
| bring it back in line with historical norms, i.e., to ensure
| that SRE continued to scale sub-linearly with SWE/systems. This
| had (primarily) two facets.
|
| First, it gave SRE teams an effectively-blank check to
| reevaluate their existing dev engagements and jettison the ones
| that weren't working well.
|
| Second, it pushed to eliminate old tools/systems/platforms and
| converge onto the more modern stuff, like Annealing [1]. Fewer
| crufty platforms means fewer teams needed to run them, and
| improvements in those platforms have broad impact.
|
| Anecdotally, my own sub-org (within SRE) is _growing_ at the
| moment. Not by a huge amount, but growing nonetheless.
|
| [1]: https://www.usenix.org/publications/loginonline/prodspec-
| and...
| Unfrozen0688 wrote:
| Ofc everyone downsizing... smh
| yashness wrote:
| Google is anyway reducing SRE task force & is planning to
| completely eliminate SRE role. There has been recent announcement
| already & have already started the move -
| https://archive.ph/YWp4O
| hedora wrote:
| The archive link is busted for me, but that sounds like a bad
| move. 90+% of SWEs are bad at and hate SRE work, and vice
| versa.
|
| The rare ones that can do a mediocre job at both (and that
| won't burn out and switch jobs if told to do both) are usually
| not capable of doing an excellent job at either.
|
| Using analogies from pretty much any other field shows how dumb
| it is to combine SRE and SWE, or fuse DevOps (or, god forbid
| DevSecOps) into one rule:
|
| - Would you have a surgeon drive an ambulance?
|
| - An expert car mechanic manage fleet scheduling and logistics?
|
| - Tell a salesman to design marketing graphics, and have your
| graphic designer manage high-value customer accounts?
| habitue wrote:
| The SRE book is a little more advertisement of Google's internal
| systems than real actionable advice outside of Google for SREs.
|
| There is some generally useful stuff in there, but it probably
| fits in a few pages vs a full book.
| lrem wrote:
| The second book is actually actionable, according to coauthors.
| kyrra wrote:
| By second book, do you mean the SWE book?
| https://abseil.io/resources/swe-book
|
| It was written by titans with the SWE ladder at Google,
| fairly disconnected from the SRE book.
| arccy wrote:
| no, the sre workbook https://sre.google/workbook/table-of-
| contents/
| lulznews wrote:
| It's really an incredible marketing piece
| pquki4 wrote:
| Only a few paragraphs into the book:
|
| > One continual challenge Google faces is hiring SREs: not only
| does SRE compete for the same candidates as the product
| development hiring pipeline, but the fact that we set the
| hiring bar so high in terms of both coding and system
| engineering skills means that our hiring pool is necessarily
| small.
|
| I was thinking, ok so does this mean the book is completely
| useless for most companies in the world, since they don't have
| such standards for hiring people or run DevOps this way? How
| much of the rest of the book is still applicable?
___________________________________________________________________
(page generated 2024-03-03 23:00 UTC)