[HN Gopher] Seeing Like an SRE: Site Reliability Engineering as ...
___________________________________________________________________
Seeing Like an SRE: Site Reliability Engineering as High Modernism
Author : zdw
Score : 142 points
Date : 2021-05-04 14:26 UTC (8 hours ago)
(HTM) web link (www.usenix.org)
(TXT) w3m dump (www.usenix.org)
| pm90 wrote:
| This is an extremely well written article. The concepts of techne
| and metis, I hope these become part of tech vocabulary and allow
| us to talk about differences in perspectives on infrastructure
| and especially infrastructure migrations more effectively without
| hating each other.
| anotha1 wrote:
| > Techne is universal knowledge: things like the boiling point
| of water, Pythagoras' theorem, the rule that all RPCs should
| have deadlines, or that we should probably alert if no
| instances of our jobs are running.
|
| > metis, is local, specific, and practical. It's won from
| experience. It can't be codified in the same way that techne
| can. The comparison that Scott gives is between navigation and
| piloting. Deepwater navigation is a general skill, but a pilot
| knows a specific port -- a 'local and situated knowledge,' as
| Scott puts it, including tides, currents, seasonal changes,
| shifting sandbars, and wind patterns. A pilot cannot move to
| another port and expect to have the same level of skill and
| local knowledge.
| rwtwe wrote:
| It might be worth noting that we don't need to rely on this
| particular book as a source for this distinction. It is
| essentially congruent with the necessary/contingent
| distinction in philosophy.
|
| Other expressions of it include the strategy/tactics
| distinction and the nomothetic/idiographic distinction. The
| idea is based on the very ancient observation that phenomena
| involve both general laws and specific circumstances.
| r0s wrote:
| This relates directly to automated testing. Unit test coverage is
| important, but equally important are functional tests from the
| perspective of a user executing real workflows.
|
| The full picture of app behavior is invaluable to the new or
| learning engineer, or even experienced engineers learning some
| unfamiliar subsystem.
| benlivengood wrote:
| Something the author didn't touch on specifically is the limit on
| languages at Google. When I left the officially supported
| languages were Java, C++, Python, and Go. That limited the scope
| of CI/CD, tracing and monitoring, and debugging to something
| tractable for the developer tools teams. It also made it
| tractable for SRE teams to be able to engage with new product
| teams without having to learn a whole new language.
|
| A really useful thing my team did (and I think it was a
| moderately successful trend on other SRE teams) was to role play
| recent outages. The oncall who had seen a particularly
| interesting outage would DM using the graphs, error messages, and
| logs they encountered when debugging a alert for a chosen victim
| (ahem, role-player) who would have to choose which graphs,
| dashboards, and logs to look at and which remediation actions to
| take to track down and fix the actual problem. It was perfect for
| building _metis_ since it was done in a team setting so everyone
| benefited from the insights into the system architecture and
| behavior and the role-player learned practical oncall skills.
| Things like escalating to other teams and running incident
| management were built into the RP.
| klodolph wrote:
| Python is "supported" but if you want to write a new program in
| Python you need approval from the area tech lead -\\_(tsu)_/-
| pm90 wrote:
| This is such a great idea. I struggle to see it being adopted
| at my current, OKR driven organization where literally any work
| is debated until death lol.
| eternalban wrote:
| Poor Corbusier, getting blamed for the architectural errors of
| Mies van der Rohe's sadly untalented copy cats, pseudo-
| intellectual ideologues, and greedy developers.
|
| For the record, Corbusier's _Ville Radieuse_ (Radiant City)
| predates the Cold War by a rather hot World War II (1930).
| Interestingly enough, it was a very Googly impulse -- "organize
| all the world's" bipeds -- that motivated the relatively young
| control freak aka architect. After WWII, Corbu mellowed. And his
| collective residential structures, _Unite d 'habitation_, were
| the result of his synthesis of a generative _measuring system_
| and _modularity_. OP and fellow SREs have quite a lot to learn
| from the mature thoughts of Le Corbusier.
|
| Over here in America, we had our own native genius, Frank Lloyd
| Wright, who devised his vision of an urbanism for a democracy -
| The Broadacre City:
|
| https://franklloydwright.org/revisiting-frank-lloyd-wrights-...
|
| But of course, the "high modernism" clique (ran by the moneyed
| set of East Coast (think MoMA), and the "ex-Nazi", Phillip
| Johnson) that did everything to marginalize Wright. And it was
| this clique, having imported wholesale (ironically) the leftist
| architects of Europe escaping Fascism, that gifted us with "high
| modernism" dystopia.
|
| If you want to learn about modern architecture, I recommend Ken
| Frampton's _Modern Architecture: A Critical History_. He was one
| of the very few actual teachers I had in architectural school
| worthy of the designation.
|
| https://en.wikipedia.org/wiki/Kenneth_Frampton
|
| https://www.goodreads.com/book/show/70140.Modern_Architectur...
|
| https://en.wikipedia.org/wiki/Philip_Johnson#Controversy_ove...
|
| https://en.wikipedia.org/wiki/Ludwig_Mies_van_der_Rohe (His own
| works were exquisite gems.)
| Simon321 wrote:
| Great article. Nice insights on techne & metis.
| throwaway823882 wrote:
| I agree with this post 200%.
|
| > Irecently spent some time trying to write a set of general
| guidelines for what to monitor in a software system
|
| Reframe as "Shit That Needs To Run To Make The Customer Happy"
| and you get closer to what you want. Which is to say, it's
| completely product-specific. A general list of technical things
| to monitor is about as useful as monitoring the cotton thread
| fiber integrity of a pair of shoelaces. Is it the cotton thread
| fiber integrity what you care about, or a general quality of the
| shoelaces? Are they shitty laces, or just decent, or great laces?
| Quantify that.
|
| > Typically, the former kind of PRR will take a quarter or more,
| because invariably, new large services have a significant amount
| of tasks to do to get them production-ready. The SRE team
| onboarding the service will spend significant time finding gaps,
| understanding what happens when dependencies of the service fail,
| improving monitoring and runbooks and automation.
|
| I deal with these a lot of the time, and I hate them because they
| are so stupid. We could make these reviews completely self-
| service and automated and they'd move a lot faster, and could
| even be on-going as the product is actually released to
| customers. But SRE and Architecture remain their own silos, and
| neither of them work closely enough with the product team or core
| engineering groups to find the streamlined, agile ways of doing
| these things. Basically, none of them grok the concept of finding
| quicker and better ways to get this shit done. Or they just don't
| care to.
|
| > The second kind of PRR typically does not uncover much to be
| done, and devolves into a tick-box exercise where the developers
| attempt to demonstrate compliance with the organisation's
| production standards.
|
| Architecture and SRE don't explain to the product team WTF they
| are going on about, so of course they just tick boxes mindlessly.
| Nobody wants to stop and understand the whole picture, so you end
| up with empty formalism.
|
| The way to "formalize" and "standardize" the operationalizing of
| a product is to make it clear _what the fuck is going on_ at each
| stage of your product. Who the fuck are my customers? What the
| fuck are they doing with the product? How the fuck does the
| product work for them, and internally? What the fuck are the
| external dependencies and how do _they_ work? You need simple,
| practical ways to express these things.
|
| And you also need to train people as to why _everyone_ needs to
| understand these things. Why you cannot just allow someone to sit
| in their little corner of a room and jerk off and collect a
| paycheck. I often hear it from developers ( "I just want to write
| code") but literally everyone else in the organization does it
| too.
| jart wrote:
| The author makes Kubernetes sound like it's a technocratic regime
| controlled by a political class of anyone who's ever held the
| title SRE at Google. They do control the means of production. Me
| however, I'm just a member of the typing pool.
| lacker wrote:
| Perhaps everyone who was ever an SRE at Google added one new
| configuration option to Kubernetes, and that's how it ended up
| this way.
| logicslave wrote:
| You joke but thats what happened with Tensorflow at Google.
| Everyone wanted a "contributed to tensorflow" on their resume
| jart wrote:
| Well I think what they wanted was for their work to be
| used. It was a great big bag of things.
| ravi-delia wrote:
| My (admittedly limited) experience is that systems aren't
| maintainable except by people that are very familiar with them.
| The basic principles of the SRE don't ignore that, they embrace
| it. Rather than trying to manage a system from the top, they
| encourage the admin to delve in and craft it themselves. By
| bringing infrastructure close to the users of that
| infrastructure, everyone gets a chance to gain hands on
| knowledge. Is that how it actually turns out? Maybe, maybe not.
| SideburnsOfDoom wrote:
| > systems aren't maintainable except by people that are very
| familiar with them.
|
| I think that a consequence of "two sorts of knowledge: techne
| and metis" is that standardisation is good, but it only gets
| you so far. Past that point, you need to be familiar with the
| system.
|
| This should not devalue our efforts to standardise, e.g. get
| systems to all log to the same aggregator, and emit the same
| basic stats, agree on naming and forwarding of correlation ids
| that will allow us to cross-reference related log entries.
|
| But we should also recognise that those efforts will never
| cover everything.
|
| e.g. If I changed over to working on an unfamiliar system in
| the same organisation, I would know where it should be logging
| to, what the field naming and general structure of those log
| entries should be, but I would not not know what healthy
| operation should look like in those logs.
| theptip wrote:
| It's an interesting comparison. Looking back in the history of
| software, "A pattern language" was an architectural treatise
| which inspired the software concept of "software design
| patterns".
|
| Similarly, I can see that considering the known issues with top-
| down vs. bottom-up city planning/evolution could be beneficial
| for software-centric organizations too; the issues with badly-fit
| top-down city plans seem to match very well with the pains of an
| ill-fit software architecture that's mandated from an ivory
| tower, complete with users using the planned cities "wrong".
|
| I'm sure there are differences though. You have a lot more
| observability into your software systems, and at the end of the
| day, they are orders of magnitude less complicated than cities,
| so you can comprehend more of the system at once, and truly find
| common usecases to standardize around. This is in contrast to
| cities where it's impossible to really know every citizen's
| unique needs, temperament, and usage patterns.
|
| Worth thinking about more; given the relatively low cross-
| pollination rates between the fields, I suspect there are more
| lessons that software engineers could glean from architecture and
| city planning.
| ssivark wrote:
| A key underlying assumption in Scott's perspective in "Seeing
| like a State" is that diversity is critically important to
| healthy functioning of biological/human/cultural ecosystems. In
| large computing system fleets we're often okay with the
| opposite -- simplifying by fiat because the
| understanding/control of the architect is more important than
| the diversity of individual machine configurations. Yes, the
| monoculture could lead to correlated failures (Eg: all machines
| are vulnerable to the same exploit), but the common perspective
| is that the simplicity/controllability and efficiency gains are
| worth it.
|
| I think we might be able to get by with this perspective so
| long as we're seeing computers/systems only as inert tools.
| It's interesting to consider whether there's any motivation for
| that to change, as we move towards more ubiquitous &
| intelligent computing. (Eg: should IOT devices be thought akin
| to insects?)
| WJW wrote:
| One of the key differences is that (the various components
| of) nature has no common goal except that each individual
| component wants to reproduce, while large computing systems
| are almost always constructed to achieve some particular
| objective. Thus, nature is OK with it if predators randomly
| kill some percent of the population while most factories
| would frown very much if a random employee started sabotaging
| lathes or something..
|
| You could argue that netflix-style chaos engineering is an
| attempt to introduce more resilience into the system
| precisely by mimicking natures "anything can die at any
| moment" principle, but even then it typically only applies to
| computers. Netflix is known for firing fast but I don't think
| even they would consider randomly firing employees to make
| sure there are no single points of failure in the employee
| makeup. Would be interesting though: tax filing need to be
| submitted next Tuesday but the CFO was just fired, what is
| your recovery plan?
| Kalium wrote:
| I've encountered the idea of a Chaos HR Simian. People get
| random, unplanned, multi-week vacations.
|
| Mentioned here:
| https://www.cognitect.com/blog/2016/3/24/the-new-normal-
| embr...
|
| I know I've been on teams that were significantly disrupted
| by jury duty, medical incidents, traffic accidents, etc. So
| it seems like a reasonable way to simulate this.
| Jiocus wrote:
| The author touches on _knowledge management_ , which is one of
| the most interesting subjects I was able to study at uni (part of
| CS/SS). A kind of analogy to the techne/metis is the concept of
| _explicit_ and _implicit_ knowledge.
|
| We codify knowledge or information into _explicit knowledge_ such
| as documentation, expert systems or design. Not all knowledge
| lends itself to this.
|
| _Implicit knowledge_ is that which often require experience and
| learning by doing. It is hard to capture explicitly. On the one
| hand because the skilled individual might be unaware of the skill
| in action, on the other they may be unable to express it.
|
| Various hacks are then tried to pry this valuable asset out into
| the open, so it can be recorded on a corporate wiki.
| deeblering4 wrote:
| It's baffling to see a majority of the tech industry adopt the
| job title that a giant advertising company invented as some sort
| of panacea, because they wrote a book.
|
| SRE has some good ideas in principal. But in practice, unless you
| are Google, it often leads to over-engineering.
| 0xbadcafebee wrote:
| It's an industry specialization with no formal training. You
| can get a degree in computer science. You can't get a degree in
| running large-scale computer systems. Plumbers have better
| training than we do. Garbage men have better training than we
| do.
|
| If any organization is bold enough to write a book on it, that
| book becomes the de-facto standard. (it helps that it's 10000%
| easier than buying and reading a million disparate ISO
| standards on Information Technology)
___________________________________________________________________
(page generated 2021-05-04 23:01 UTC)