[HN Gopher] A few ops lessons we all learn the hard way (2020)
       ___________________________________________________________________
        
       A few ops lessons we all learn the hard way (2020)
        
       Author : Tomte
       Score  : 129 points
       Date   : 2022-08-22 12:55 UTC (10 hours ago)
        
 (HTM) web link (www.netmeister.org)
 (TXT) w3m dump (www.netmeister.org)
        
       | encoderer wrote:
       | Absence of a signal is itself a signal.
       | 
       | I've built a business around this one!
        
       | bravetraveler wrote:
       | Quite interesting but I don't really understand the framing with
       | self-signed certs
       | 
       | It can and (often does) go poorly, but the only thing you really
       | gain from an external CA is a [minor] reduction in responsibility
        
         | number6 wrote:
         | Ok lets spin up the ca and create a cert for this service.
         | Validity 15 years.
         | 
         | <15 years later>
         | 
         | Why did this thingy break? How do I get a new cert? CA was on a
         | Virtul Machine? The old server? What is ESXi? Ok we will just
         | spin up this docker container, create a new cert and make it
         | last 20 years...
        
           | bombcar wrote:
           | This is the main reason let's encrypt gives you three month
           | certs - so you setup a way of automating it.
           | 
           | It's not too hard but I wish it could be even easier. The
           | best I've found is a main system that gets the wildcard cert
           | and then transfers that around where needed.
        
           | bravetraveler wrote:
           | I'm not doubting it can be done poorly
           | 
           | I'm just curious why an org would do that when they have a
           | domain controller/login infrastructure providing a CA, or one
           | of the many secret-storing-engines backing the vast majority
           | of their sensitive data (eg: Vault)?
           | 
           | Policies can be defined and enforced to serve as guard rails
           | -- we choose the nightmares we accept
        
             | zaphar wrote:
             | Ahhh, yes I see where you got lost. You are assuming that
             | they either:
             | 
             | * Have a team willing to commit to managing CAs as part of
             | the DC/Login infrastructure.
             | 
             | or
             | 
             | * Are actually using something like Vault in their
             | infrastructure.
             | 
             | Common mistake, really.
             | 
             | In all seriousness you would be amazed how many places
             | neither of those things are true for, and additionally how
             | much sheer effort it would take to make them true at those
             | orgs.
        
               | 0xbadcafebee wrote:
               | If you have a team that manages a domain, it will take a
               | month to get anything done with them, if they even allow
               | you to get what you need done. If there is a way to make
               | it easier, they aren't interested.
        
               | bravetraveler wrote:
               | Instead we get to deal with finance for two months to
               | order from a public CA!
               | 
               | Before anyone says it, yes we know about LetsEncrypt, and
               | no we cannot use it
        
               | bravetraveler wrote:
               | Indeed, I am a bit spoiled... I neglected smaller shops
               | where 'not enough people' rings even more true.
               | 
               | Where I'm at, we have tons and tons of people, enough to
               | dedicate (and form) teams for things. It admittedly skews
               | my view quite a bit. Here... we don't have enough
               | _capable_ people.
               | 
               | I'm also a bit sour at the alternatives. We have certain
               | (publicly visible) certificates that must go through a
               | CA. That's an absolutely painful process that requires
               | about six levels of finance approval -- every year!
        
       | pid-1 wrote:
       | > The source you're looking at is not the code running in
       | production.
       | 
       | My first boss favorite words were: check check check check check.
       | That's also the first thing I teach to new engineers: most of
       | your assumptions are completly wrong, double check everything.
        
         | d23 wrote:
         | Most debugging is simply a series of validations of which of
         | your assumptions is incorrect.
        
       | teddyh wrote:
       | As per #53, #37 could probably use a link to
       | https://xkcd.com/1597/
        
       | belfalas wrote:
       | _> Your network team has a way into the network that your
       | security team doesn 't know about._
       | 
       | This one is true, and even if you right now go and try to make it
       | not true for your company...it will become true again later.
        
         | WFHRenaissance wrote:
         | This, along with the obvious intersection of skillsets and
         | interests, is why I often now see either network ops and
         | security bundled together on the same team/org, or DevOps
         | people doing the work of both teams.
        
           | dsr_ wrote:
           | DevNetSecOps is much more sensible than most of the other
           | ways. The audit team needs to be independent, of course.
        
       | olddustytrail wrote:
       | Many that are absolutely true here. But also one that's flat out
       | wrong:
       | 
       | 37. Nobody knows how git works; everybody simply rm -fr && git
       | checkout's periodically.
       | 
       | I know how it works. Because the first time I ever heard of git
       | was when I had started a new job and was told the dev team were
       | switching to it. So I spent a couple of days reading up on it and
       | learned exactly how it works.
       | 
       | If you work in Ops, I suggest you do your job properly and do
       | likewise.
        
         | kerblang wrote:
         | I learned how git works but without my cheat sheet and scripts
         | I am a helpless angry baby. (guess where I keep them? git)
         | 
         | Still I happily award the Cryptic Weirdo Savant of the Year
         | Trophy to anyone who can convincingly lie about memorizing that
         | gibberish.
        
           | olddustytrail wrote:
           | You clearly didn't learn how it works since you're helpless
           | without a cheat sheet. Sounds like you memorized some stuff.
           | 
           | I didn't memorize it, I learned how it worked. These are very
           | different things.
        
             | kerblang wrote:
             | That was mildly convincing but it's been a fairly
             | competitive year
        
               | olddustytrail wrote:
               | It's not because I'm super smart, you're just super lazy.
        
               | AnimalMuppet wrote:
               | Personal attacks are against the site rules here.
        
               | lolc wrote:
               | Cute.
               | 
               | I typically use an array of Git commands per hour of
               | working. I consult the manual or a how-to maybe once
               | every other week. Re-cloning a repo happens less than
               | once a year for me. Can't remember the last time I had to
               | do that.
               | 
               | And it's not just me. I see the same for the people I
               | work with.
               | 
               | Of course, for many people, learning Git is not worth the
               | effort. But that doesn't mean people who handle it
               | fluently don't exist.
        
         | icedchai wrote:
         | A lot of junior people have no idea how to use git. No doubt,
         | it _is_ confusing especially if you 've never used source
         | control before. I've seen some seriously screwed up git
         | "flows". I've seen people who have no idea what a merge
         | conflict is, or how to resolve one, so they wind up committing
         | the conflicts.
         | 
         | It wasn't much different in the subversion days, or before
         | (CVS, anyone?)
        
           | olddustytrail wrote:
           | I know, and what I'm saying is that it's not confusing if you
           | learn how it works. If you just try to figure it out as you
           | go along you'll end up with a mental model which is vastly
           | more complicated than git itself is!
        
       | crazylifetwist wrote:
       | Cache:
       | https://web.archive.org/web/20220817133026/https://www.netme...
        
       | gonzo41 wrote:
       | No, 11. Took the last place I worked about 5 years to rollout a
       | pretty good solution to a massive legacy set of servers. Though I
       | gotta say, the super privileged automated certificate renewer
       | thingy does seem like a real honey pot.
        
       | mcqueenjordan wrote:
       | Really good article. Some of these are subtle, and really must be
       | learned the hard way. The only one I found myself thinking I
       | disagreed with was "85. Multithreading is rarely worth the added
       | complexity." Maybe I simply have yet to learn it the hard way,
       | but of all the ways to add complexity, I have tended to find
       | multithreading as one of the more legitimate. That being said, it
       | has to be done in a simple, easy to reason about way. Usually for
       | me, this means fork-joining homogeneous tasks.
        
         | eschneider wrote:
         | A. However well you understand multithreading, you only need
         | one coworker who doesn't understand multithreading to make your
         | life an unending hell. B. You always have at least one coworker
         | who doesn't completely understand multithreading. :/
        
           | cpurdy wrote:
           | multithreading to A. understand coworker who multithreading,
           | you only need one multithreading. :/ doesn't coworker who
           | However understand well you life an unending hell. B. You
           | always have at least one doesn't make your completely
           | understand
        
         | twh270 wrote:
         | Yes. If you're going to do multi-threading, let the
         | framework/language handle the hard parts[1].
         | 
         | [1] It's _all_ hard parts.
        
         | naasking wrote:
         | > Usually for me, this means fork-joining homogeneous tasks
         | 
         | I think the article doesn't make the right distinction:
         | parallelism is often worth it, concurrency is what causes the
         | headaches. Fork-join is parallelism and generally safe and
         | relatively easy.
        
       | andreareina wrote:
       | TAI > UTC. Except for the fact that no-one uses TAI.
        
         | marcosdumay wrote:
         | Almost everyone is quite ok with whatever, will treat whatever
         | they get as UTC, will be happy to ignore the difference.
         | 
         | Everybody that is not on the above paragraph uses TAI.
        
       | selimnairb wrote:
       | "CAPEX budget always increases, OPEX budget always decreases." is
       | a great synopsis of how capitalism works.
        
       | 0xbadcafebee wrote:
       | _If a post-mortem follow-up task is not picked up within a week,
       | it 's unlikely to be completed at all._
       | 
       | This one is literally a law of physics.
       | 
       |  _People give talks at conferences not to convince others that
       | their work is awesome and totally worth the time and effort they
       | put in, but themselves._
       | 
       | I would add: "If you see a big name company give a talk about
       | some cool thing they made, it's probably already been abandoned
       | by that company."
       | 
       |  _Turning things off permanently is surprisingly difficult._
       | 
       | If you don't have a plan to sunset whatever you're building,
       | you're basically telling your future self to go fuck himself.
       | Unless you plan to quit first, in which case you're telling your
       | successor to go fuck himself.
       | 
       |  _The source you 're looking at is not the code running in
       | production._
       | 
       | ((cries))
       | 
       |  _Mandatory code reviews do not automatically improve code
       | quality nor reduce the frequency of incidents._
       | 
       | The primary purpose of mandatory code reviews - without a
       | sensible plan of who, when, why or how - is just for people to
       | nitpick your code.
        
         | marcosdumay wrote:
         | A "plan to sunset" something looks like an incredibly alien
         | concept to me. How do you do it when people keep finding new
         | uses for whatever it is?
        
           | 0xbadcafebee wrote:
           | It's very hard, for that very reason that more of an
           | organization will depend on it over time, making it harder to
           | extricate from it. But there's a number of things that help.
           | 
           | 0. Ownership. Try to manipulate the business to put this
           | project under a part of the org where it's very hard for
           | anyone to have leverage over you, so you can fight back when
           | they try to pressure you to keep it going with no budget or
           | staff. (Can you put it under "finance" or "admin" or "HR"?
           | They won't give a shit about your project and aren't
           | responsible to the tech leadership. Sometimes "IT" is the
           | same.)
           | 
           | 1. Money. Assign a fixed budget that runs out after X time.
           | Put yourself in a position that you have no way to ask for
           | more money, so they can't keep trying to stretch your team
           | out with no additional funding. Calculate how much it'll cost
           | the business to try to support it past EOL and put that
           | figure where everyone can see it.
           | 
           | 2. Limits. Design in very specific quotas and limits that
           | give very reliable but limited functionality. If somebody
           | wants this to scale 100x, show them how it literally can't.
           | Prevent stakeholders from trying to do more than is possible.
           | If they want fewer limits, tell them to give you money out of
           | their budget to build and staff a single-tenant version of
           | it. They will quickly go away, probably to (poorly) build
           | their own version of it.
           | 
           | 3. Disclosure. Tell all stakeholders what this thing can and
           | can't do, that you won't be able to scale, what your SLA is,
           | when this thing will be EOL and that they need to put on
           | their calendars now to work to move off of it in time. Do not
           | tell them the actual EOL date, tell them a date 6 months
           | before the actual cut-off date. Communicate often and via
           | various means in public places, because most people will
           | never read anything they aren't interested in.
           | 
           | 4. Stakeholder management. Tightly control who is using your
           | system and what they're using it for. Document the downstream
           | business risk. Make a big stink if somebody starts using your
           | dinky little project with no funding for something mission
           | critical. Remind them of how your limits and budget and SLA
           | and design are all tied together and can't be worked around
           | without redesigning the whole thing.
           | 
           | 5. Transition planning. When your system goes away, something
           | needs to take its place. At design phase, incorporate a
           | timeline that includes a large chunk of time just for
           | supporting getting people off the platform. Also plan for how
           | you could offload the entire system onto some other system.
           | Create a document that lists what a new system will need to
           | have, so whoever is tasked with that will not build something
           | that is impossible to transition to. At sunset time, redirect
           | work towards the transition. Have a solid change management
           | plan and get stakeholder sign-off.
           | 
           | 6. Rigorously track the value created by this thing, or the
           | value lost by trying to maintain it past its sunset date, and
           | all business risks. Collect hard data. You will need it later
           | to argue to senior leadership why keeping this thing online
           | is a terrible idea.
        
             | cpeterso wrote:
             | For more on sunsetting or replacing legacy systems, check
             | out Marianne Bellotti's "Kill It With Fire: Manage Aging
             | Computer Systems (and Future Proof Modern Ones)". Here's a
             | review of the book:
             | 
             | https://www.usenix.org/publications/loginonline/kill-it-
             | fire
        
           | WJW wrote:
           | With careful change management and access control. It's easy
           | to turn off a system if you know for sure that nobody is
           | using it anymore. Large companies and militaries do it all
           | the time just fine so there is no reason a dynamic young
           | startup shouldn't be able to do it. :)
           | 
           | (As a non-sarcastic response: back when I was an officer in
           | the navy, how we would get rid of systems was most definitely
           | taken into account from the very start. Even before we
           | started building ships or radars or whatever, budgets and
           | dock space would be reserved ~30 years into the future to do
           | the decommissioning work. I do realize that this works much
           | better for established organizations that can be reasonably
           | sure that they will exists in 30 years, after all most
           | startups are hard pressed to last even two. That said,
           | completely disregarding any planning on how to grow out of
           | your current systems does seem to have been the case at all
           | the startups I have consulted for and was a major part of
           | technical debt. IMO a technical leader should know where the
           | skeletons are hidden in the setup for their organization and
           | know roughly in how many months/years they will no longer
           | suffice. Then they can plan the replacement and/or upgrading
           | of said skeletons accordingly)
        
           | KineticLensman wrote:
           | > How do you do it when people keep finding new uses for
           | whatever it is
           | 
           | Assuming there is an upgrade path it does make sense to plan
           | to turn a thing off. The thing may consume resources /
           | facilities that could be redeployed, operators could become
           | available for other tasks (or require retraining, etc), or
           | the controlling organisation might need to be restructured.
           | There might be regulatory implications / costs, such as
           | recycling or disposal of controlled substances etc.
           | 
           | If a retired thing is 'pure software', disposal might be
           | simplified, but if it has physical or facilities elements (as
           | per military capabilities mentioned by a sibling poster),
           | disposal can be decidedly non-trivial.
        
       | dsr_ wrote:
       | I can point to examples for at least 80% of these.
       | 
       | Mostly not in public, though.
        
       ___________________________________________________________________
       (page generated 2022-08-22 23:01 UTC)