[HN Gopher] Seeing Like an SRE: Site Reliability Engineering as ...
       ___________________________________________________________________
        
       Seeing Like an SRE: Site Reliability Engineering as High Modernism
        
       Author : zdw
       Score  : 142 points
       Date   : 2021-05-04 14:26 UTC (8 hours ago)
        
 (HTM) web link (www.usenix.org)
 (TXT) w3m dump (www.usenix.org)
        
       | pm90 wrote:
       | This is an extremely well written article. The concepts of techne
       | and metis, I hope these become part of tech vocabulary and allow
       | us to talk about differences in perspectives on infrastructure
       | and especially infrastructure migrations more effectively without
       | hating each other.
        
         | anotha1 wrote:
         | > Techne is universal knowledge: things like the boiling point
         | of water, Pythagoras' theorem, the rule that all RPCs should
         | have deadlines, or that we should probably alert if no
         | instances of our jobs are running.
         | 
         | > metis, is local, specific, and practical. It's won from
         | experience. It can't be codified in the same way that techne
         | can. The comparison that Scott gives is between navigation and
         | piloting. Deepwater navigation is a general skill, but a pilot
         | knows a specific port -- a 'local and situated knowledge,' as
         | Scott puts it, including tides, currents, seasonal changes,
         | shifting sandbars, and wind patterns. A pilot cannot move to
         | another port and expect to have the same level of skill and
         | local knowledge.
        
           | rwtwe wrote:
           | It might be worth noting that we don't need to rely on this
           | particular book as a source for this distinction. It is
           | essentially congruent with the necessary/contingent
           | distinction in philosophy.
           | 
           | Other expressions of it include the strategy/tactics
           | distinction and the nomothetic/idiographic distinction. The
           | idea is based on the very ancient observation that phenomena
           | involve both general laws and specific circumstances.
        
       | r0s wrote:
       | This relates directly to automated testing. Unit test coverage is
       | important, but equally important are functional tests from the
       | perspective of a user executing real workflows.
       | 
       | The full picture of app behavior is invaluable to the new or
       | learning engineer, or even experienced engineers learning some
       | unfamiliar subsystem.
        
       | benlivengood wrote:
       | Something the author didn't touch on specifically is the limit on
       | languages at Google. When I left the officially supported
       | languages were Java, C++, Python, and Go. That limited the scope
       | of CI/CD, tracing and monitoring, and debugging to something
       | tractable for the developer tools teams. It also made it
       | tractable for SRE teams to be able to engage with new product
       | teams without having to learn a whole new language.
       | 
       | A really useful thing my team did (and I think it was a
       | moderately successful trend on other SRE teams) was to role play
       | recent outages. The oncall who had seen a particularly
       | interesting outage would DM using the graphs, error messages, and
       | logs they encountered when debugging a alert for a chosen victim
       | (ahem, role-player) who would have to choose which graphs,
       | dashboards, and logs to look at and which remediation actions to
       | take to track down and fix the actual problem. It was perfect for
       | building _metis_ since it was done in a team setting so everyone
       | benefited from the insights into the system architecture and
       | behavior and the role-player learned practical oncall skills.
       | Things like escalating to other teams and running incident
       | management were built into the RP.
        
         | klodolph wrote:
         | Python is "supported" but if you want to write a new program in
         | Python you need approval from the area tech lead -\\_(tsu)_/-
        
         | pm90 wrote:
         | This is such a great idea. I struggle to see it being adopted
         | at my current, OKR driven organization where literally any work
         | is debated until death lol.
        
       | eternalban wrote:
       | Poor Corbusier, getting blamed for the architectural errors of
       | Mies van der Rohe's sadly untalented copy cats, pseudo-
       | intellectual ideologues, and greedy developers.
       | 
       | For the record, Corbusier's _Ville Radieuse_ (Radiant City)
       | predates the Cold War by a rather hot World War II (1930).
       | Interestingly enough, it was a very Googly impulse --  "organize
       | all the world's" bipeds -- that motivated the relatively young
       | control freak aka architect. After WWII, Corbu mellowed. And his
       | collective residential structures, _Unite d 'habitation_, were
       | the result of his synthesis of a generative _measuring system_
       | and _modularity_. OP and fellow SREs have quite a lot to learn
       | from the mature thoughts of Le Corbusier.
       | 
       | Over here in America, we had our own native genius, Frank Lloyd
       | Wright, who devised his vision of an urbanism for a democracy -
       | The Broadacre City:
       | 
       | https://franklloydwright.org/revisiting-frank-lloyd-wrights-...
       | 
       | But of course, the "high modernism" clique (ran by the moneyed
       | set of East Coast (think MoMA), and the "ex-Nazi", Phillip
       | Johnson) that did everything to marginalize Wright. And it was
       | this clique, having imported wholesale (ironically) the leftist
       | architects of Europe escaping Fascism, that gifted us with "high
       | modernism" dystopia.
       | 
       | If you want to learn about modern architecture, I recommend Ken
       | Frampton's _Modern Architecture: A Critical History_. He was one
       | of the very few actual teachers I had in architectural school
       | worthy of the designation.
       | 
       | https://en.wikipedia.org/wiki/Kenneth_Frampton
       | 
       | https://www.goodreads.com/book/show/70140.Modern_Architectur...
       | 
       | https://en.wikipedia.org/wiki/Philip_Johnson#Controversy_ove...
       | 
       | https://en.wikipedia.org/wiki/Ludwig_Mies_van_der_Rohe (His own
       | works were exquisite gems.)
        
       | Simon321 wrote:
       | Great article. Nice insights on techne & metis.
        
       | throwaway823882 wrote:
       | I agree with this post 200%.
       | 
       | > Irecently spent some time trying to write a set of general
       | guidelines for what to monitor in a software system
       | 
       | Reframe as "Shit That Needs To Run To Make The Customer Happy"
       | and you get closer to what you want. Which is to say, it's
       | completely product-specific. A general list of technical things
       | to monitor is about as useful as monitoring the cotton thread
       | fiber integrity of a pair of shoelaces. Is it the cotton thread
       | fiber integrity what you care about, or a general quality of the
       | shoelaces? Are they shitty laces, or just decent, or great laces?
       | Quantify that.
       | 
       | > Typically, the former kind of PRR will take a quarter or more,
       | because invariably, new large services have a significant amount
       | of tasks to do to get them production-ready. The SRE team
       | onboarding the service will spend significant time finding gaps,
       | understanding what happens when dependencies of the service fail,
       | improving monitoring and runbooks and automation.
       | 
       | I deal with these a lot of the time, and I hate them because they
       | are so stupid. We could make these reviews completely self-
       | service and automated and they'd move a lot faster, and could
       | even be on-going as the product is actually released to
       | customers. But SRE and Architecture remain their own silos, and
       | neither of them work closely enough with the product team or core
       | engineering groups to find the streamlined, agile ways of doing
       | these things. Basically, none of them grok the concept of finding
       | quicker and better ways to get this shit done. Or they just don't
       | care to.
       | 
       | > The second kind of PRR typically does not uncover much to be
       | done, and devolves into a tick-box exercise where the developers
       | attempt to demonstrate compliance with the organisation's
       | production standards.
       | 
       | Architecture and SRE don't explain to the product team WTF they
       | are going on about, so of course they just tick boxes mindlessly.
       | Nobody wants to stop and understand the whole picture, so you end
       | up with empty formalism.
       | 
       | The way to "formalize" and "standardize" the operationalizing of
       | a product is to make it clear _what the fuck is going on_ at each
       | stage of your product. Who the fuck are my customers? What the
       | fuck are they doing with the product? How the fuck does the
       | product work for them, and internally? What the fuck are the
       | external dependencies and how do _they_ work? You need simple,
       | practical ways to express these things.
       | 
       | And you also need to train people as to why _everyone_ needs to
       | understand these things. Why you cannot just allow someone to sit
       | in their little corner of a room and jerk off and collect a
       | paycheck. I often hear it from developers ( "I just want to write
       | code") but literally everyone else in the organization does it
       | too.
        
       | jart wrote:
       | The author makes Kubernetes sound like it's a technocratic regime
       | controlled by a political class of anyone who's ever held the
       | title SRE at Google. They do control the means of production. Me
       | however, I'm just a member of the typing pool.
        
         | lacker wrote:
         | Perhaps everyone who was ever an SRE at Google added one new
         | configuration option to Kubernetes, and that's how it ended up
         | this way.
        
           | logicslave wrote:
           | You joke but thats what happened with Tensorflow at Google.
           | Everyone wanted a "contributed to tensorflow" on their resume
        
             | jart wrote:
             | Well I think what they wanted was for their work to be
             | used. It was a great big bag of things.
        
       | ravi-delia wrote:
       | My (admittedly limited) experience is that systems aren't
       | maintainable except by people that are very familiar with them.
       | The basic principles of the SRE don't ignore that, they embrace
       | it. Rather than trying to manage a system from the top, they
       | encourage the admin to delve in and craft it themselves. By
       | bringing infrastructure close to the users of that
       | infrastructure, everyone gets a chance to gain hands on
       | knowledge. Is that how it actually turns out? Maybe, maybe not.
        
         | SideburnsOfDoom wrote:
         | > systems aren't maintainable except by people that are very
         | familiar with them.
         | 
         | I think that a consequence of "two sorts of knowledge: techne
         | and metis" is that standardisation is good, but it only gets
         | you so far. Past that point, you need to be familiar with the
         | system.
         | 
         | This should not devalue our efforts to standardise, e.g. get
         | systems to all log to the same aggregator, and emit the same
         | basic stats, agree on naming and forwarding of correlation ids
         | that will allow us to cross-reference related log entries.
         | 
         | But we should also recognise that those efforts will never
         | cover everything.
         | 
         | e.g. If I changed over to working on an unfamiliar system in
         | the same organisation, I would know where it should be logging
         | to, what the field naming and general structure of those log
         | entries should be, but I would not not know what healthy
         | operation should look like in those logs.
        
       | theptip wrote:
       | It's an interesting comparison. Looking back in the history of
       | software, "A pattern language" was an architectural treatise
       | which inspired the software concept of "software design
       | patterns".
       | 
       | Similarly, I can see that considering the known issues with top-
       | down vs. bottom-up city planning/evolution could be beneficial
       | for software-centric organizations too; the issues with badly-fit
       | top-down city plans seem to match very well with the pains of an
       | ill-fit software architecture that's mandated from an ivory
       | tower, complete with users using the planned cities "wrong".
       | 
       | I'm sure there are differences though. You have a lot more
       | observability into your software systems, and at the end of the
       | day, they are orders of magnitude less complicated than cities,
       | so you can comprehend more of the system at once, and truly find
       | common usecases to standardize around. This is in contrast to
       | cities where it's impossible to really know every citizen's
       | unique needs, temperament, and usage patterns.
       | 
       | Worth thinking about more; given the relatively low cross-
       | pollination rates between the fields, I suspect there are more
       | lessons that software engineers could glean from architecture and
       | city planning.
        
         | ssivark wrote:
         | A key underlying assumption in Scott's perspective in "Seeing
         | like a State" is that diversity is critically important to
         | healthy functioning of biological/human/cultural ecosystems. In
         | large computing system fleets we're often okay with the
         | opposite -- simplifying by fiat because the
         | understanding/control of the architect is more important than
         | the diversity of individual machine configurations. Yes, the
         | monoculture could lead to correlated failures (Eg: all machines
         | are vulnerable to the same exploit), but the common perspective
         | is that the simplicity/controllability and efficiency gains are
         | worth it.
         | 
         | I think we might be able to get by with this perspective so
         | long as we're seeing computers/systems only as inert tools.
         | It's interesting to consider whether there's any motivation for
         | that to change, as we move towards more ubiquitous &
         | intelligent computing. (Eg: should IOT devices be thought akin
         | to insects?)
        
           | WJW wrote:
           | One of the key differences is that (the various components
           | of) nature has no common goal except that each individual
           | component wants to reproduce, while large computing systems
           | are almost always constructed to achieve some particular
           | objective. Thus, nature is OK with it if predators randomly
           | kill some percent of the population while most factories
           | would frown very much if a random employee started sabotaging
           | lathes or something..
           | 
           | You could argue that netflix-style chaos engineering is an
           | attempt to introduce more resilience into the system
           | precisely by mimicking natures "anything can die at any
           | moment" principle, but even then it typically only applies to
           | computers. Netflix is known for firing fast but I don't think
           | even they would consider randomly firing employees to make
           | sure there are no single points of failure in the employee
           | makeup. Would be interesting though: tax filing need to be
           | submitted next Tuesday but the CFO was just fired, what is
           | your recovery plan?
        
             | Kalium wrote:
             | I've encountered the idea of a Chaos HR Simian. People get
             | random, unplanned, multi-week vacations.
             | 
             | Mentioned here:
             | https://www.cognitect.com/blog/2016/3/24/the-new-normal-
             | embr...
             | 
             | I know I've been on teams that were significantly disrupted
             | by jury duty, medical incidents, traffic accidents, etc. So
             | it seems like a reasonable way to simulate this.
        
       | Jiocus wrote:
       | The author touches on _knowledge management_ , which is one of
       | the most interesting subjects I was able to study at uni (part of
       | CS/SS). A kind of analogy to the techne/metis is the concept of
       | _explicit_ and _implicit_ knowledge.
       | 
       | We codify knowledge or information into _explicit knowledge_ such
       | as documentation, expert systems or design. Not all knowledge
       | lends itself to this.
       | 
       |  _Implicit knowledge_ is that which often require experience and
       | learning by doing. It is hard to capture explicitly. On the one
       | hand because the skilled individual might be unaware of the skill
       | in action, on the other they may be unable to express it.
       | 
       | Various hacks are then tried to pry this valuable asset out into
       | the open, so it can be recorded on a corporate wiki.
        
       | deeblering4 wrote:
       | It's baffling to see a majority of the tech industry adopt the
       | job title that a giant advertising company invented as some sort
       | of panacea, because they wrote a book.
       | 
       | SRE has some good ideas in principal. But in practice, unless you
       | are Google, it often leads to over-engineering.
        
         | 0xbadcafebee wrote:
         | It's an industry specialization with no formal training. You
         | can get a degree in computer science. You can't get a degree in
         | running large-scale computer systems. Plumbers have better
         | training than we do. Garbage men have better training than we
         | do.
         | 
         | If any organization is bold enough to write a book on it, that
         | book becomes the de-facto standard. (it helps that it's 10000%
         | easier than buying and reading a million disparate ISO
         | standards on Information Technology)
        
       ___________________________________________________________________
       (page generated 2021-05-04 23:01 UTC)