[HN Gopher] A few ops lessons we all learn the hard way (2020)
___________________________________________________________________
A few ops lessons we all learn the hard way (2020)
Author : Tomte
Score : 129 points
Date : 2022-08-22 12:55 UTC (10 hours ago)
(HTM) web link (www.netmeister.org)
(TXT) w3m dump (www.netmeister.org)
| encoderer wrote:
| Absence of a signal is itself a signal.
|
| I've built a business around this one!
| bravetraveler wrote:
| Quite interesting but I don't really understand the framing with
| self-signed certs
|
| It can and (often does) go poorly, but the only thing you really
| gain from an external CA is a [minor] reduction in responsibility
| number6 wrote:
| Ok lets spin up the ca and create a cert for this service.
| Validity 15 years.
|
| <15 years later>
|
| Why did this thingy break? How do I get a new cert? CA was on a
| Virtul Machine? The old server? What is ESXi? Ok we will just
| spin up this docker container, create a new cert and make it
| last 20 years...
| bombcar wrote:
| This is the main reason let's encrypt gives you three month
| certs - so you setup a way of automating it.
|
| It's not too hard but I wish it could be even easier. The
| best I've found is a main system that gets the wildcard cert
| and then transfers that around where needed.
| bravetraveler wrote:
| I'm not doubting it can be done poorly
|
| I'm just curious why an org would do that when they have a
| domain controller/login infrastructure providing a CA, or one
| of the many secret-storing-engines backing the vast majority
| of their sensitive data (eg: Vault)?
|
| Policies can be defined and enforced to serve as guard rails
| -- we choose the nightmares we accept
| zaphar wrote:
| Ahhh, yes I see where you got lost. You are assuming that
| they either:
|
| * Have a team willing to commit to managing CAs as part of
| the DC/Login infrastructure.
|
| or
|
| * Are actually using something like Vault in their
| infrastructure.
|
| Common mistake, really.
|
| In all seriousness you would be amazed how many places
| neither of those things are true for, and additionally how
| much sheer effort it would take to make them true at those
| orgs.
| 0xbadcafebee wrote:
| If you have a team that manages a domain, it will take a
| month to get anything done with them, if they even allow
| you to get what you need done. If there is a way to make
| it easier, they aren't interested.
| bravetraveler wrote:
| Instead we get to deal with finance for two months to
| order from a public CA!
|
| Before anyone says it, yes we know about LetsEncrypt, and
| no we cannot use it
| bravetraveler wrote:
| Indeed, I am a bit spoiled... I neglected smaller shops
| where 'not enough people' rings even more true.
|
| Where I'm at, we have tons and tons of people, enough to
| dedicate (and form) teams for things. It admittedly skews
| my view quite a bit. Here... we don't have enough
| _capable_ people.
|
| I'm also a bit sour at the alternatives. We have certain
| (publicly visible) certificates that must go through a
| CA. That's an absolutely painful process that requires
| about six levels of finance approval -- every year!
| pid-1 wrote:
| > The source you're looking at is not the code running in
| production.
|
| My first boss favorite words were: check check check check check.
| That's also the first thing I teach to new engineers: most of
| your assumptions are completly wrong, double check everything.
| d23 wrote:
| Most debugging is simply a series of validations of which of
| your assumptions is incorrect.
| teddyh wrote:
| As per #53, #37 could probably use a link to
| https://xkcd.com/1597/
| belfalas wrote:
| _> Your network team has a way into the network that your
| security team doesn 't know about._
|
| This one is true, and even if you right now go and try to make it
| not true for your company...it will become true again later.
| WFHRenaissance wrote:
| This, along with the obvious intersection of skillsets and
| interests, is why I often now see either network ops and
| security bundled together on the same team/org, or DevOps
| people doing the work of both teams.
| dsr_ wrote:
| DevNetSecOps is much more sensible than most of the other
| ways. The audit team needs to be independent, of course.
| olddustytrail wrote:
| Many that are absolutely true here. But also one that's flat out
| wrong:
|
| 37. Nobody knows how git works; everybody simply rm -fr && git
| checkout's periodically.
|
| I know how it works. Because the first time I ever heard of git
| was when I had started a new job and was told the dev team were
| switching to it. So I spent a couple of days reading up on it and
| learned exactly how it works.
|
| If you work in Ops, I suggest you do your job properly and do
| likewise.
| kerblang wrote:
| I learned how git works but without my cheat sheet and scripts
| I am a helpless angry baby. (guess where I keep them? git)
|
| Still I happily award the Cryptic Weirdo Savant of the Year
| Trophy to anyone who can convincingly lie about memorizing that
| gibberish.
| olddustytrail wrote:
| You clearly didn't learn how it works since you're helpless
| without a cheat sheet. Sounds like you memorized some stuff.
|
| I didn't memorize it, I learned how it worked. These are very
| different things.
| kerblang wrote:
| That was mildly convincing but it's been a fairly
| competitive year
| olddustytrail wrote:
| It's not because I'm super smart, you're just super lazy.
| AnimalMuppet wrote:
| Personal attacks are against the site rules here.
| lolc wrote:
| Cute.
|
| I typically use an array of Git commands per hour of
| working. I consult the manual or a how-to maybe once
| every other week. Re-cloning a repo happens less than
| once a year for me. Can't remember the last time I had to
| do that.
|
| And it's not just me. I see the same for the people I
| work with.
|
| Of course, for many people, learning Git is not worth the
| effort. But that doesn't mean people who handle it
| fluently don't exist.
| icedchai wrote:
| A lot of junior people have no idea how to use git. No doubt,
| it _is_ confusing especially if you 've never used source
| control before. I've seen some seriously screwed up git
| "flows". I've seen people who have no idea what a merge
| conflict is, or how to resolve one, so they wind up committing
| the conflicts.
|
| It wasn't much different in the subversion days, or before
| (CVS, anyone?)
| olddustytrail wrote:
| I know, and what I'm saying is that it's not confusing if you
| learn how it works. If you just try to figure it out as you
| go along you'll end up with a mental model which is vastly
| more complicated than git itself is!
| crazylifetwist wrote:
| Cache:
| https://web.archive.org/web/20220817133026/https://www.netme...
| gonzo41 wrote:
| No, 11. Took the last place I worked about 5 years to rollout a
| pretty good solution to a massive legacy set of servers. Though I
| gotta say, the super privileged automated certificate renewer
| thingy does seem like a real honey pot.
| mcqueenjordan wrote:
| Really good article. Some of these are subtle, and really must be
| learned the hard way. The only one I found myself thinking I
| disagreed with was "85. Multithreading is rarely worth the added
| complexity." Maybe I simply have yet to learn it the hard way,
| but of all the ways to add complexity, I have tended to find
| multithreading as one of the more legitimate. That being said, it
| has to be done in a simple, easy to reason about way. Usually for
| me, this means fork-joining homogeneous tasks.
| eschneider wrote:
| A. However well you understand multithreading, you only need
| one coworker who doesn't understand multithreading to make your
| life an unending hell. B. You always have at least one coworker
| who doesn't completely understand multithreading. :/
| cpurdy wrote:
| multithreading to A. understand coworker who multithreading,
| you only need one multithreading. :/ doesn't coworker who
| However understand well you life an unending hell. B. You
| always have at least one doesn't make your completely
| understand
| twh270 wrote:
| Yes. If you're going to do multi-threading, let the
| framework/language handle the hard parts[1].
|
| [1] It's _all_ hard parts.
| naasking wrote:
| > Usually for me, this means fork-joining homogeneous tasks
|
| I think the article doesn't make the right distinction:
| parallelism is often worth it, concurrency is what causes the
| headaches. Fork-join is parallelism and generally safe and
| relatively easy.
| andreareina wrote:
| TAI > UTC. Except for the fact that no-one uses TAI.
| marcosdumay wrote:
| Almost everyone is quite ok with whatever, will treat whatever
| they get as UTC, will be happy to ignore the difference.
|
| Everybody that is not on the above paragraph uses TAI.
| selimnairb wrote:
| "CAPEX budget always increases, OPEX budget always decreases." is
| a great synopsis of how capitalism works.
| 0xbadcafebee wrote:
| _If a post-mortem follow-up task is not picked up within a week,
| it 's unlikely to be completed at all._
|
| This one is literally a law of physics.
|
| _People give talks at conferences not to convince others that
| their work is awesome and totally worth the time and effort they
| put in, but themselves._
|
| I would add: "If you see a big name company give a talk about
| some cool thing they made, it's probably already been abandoned
| by that company."
|
| _Turning things off permanently is surprisingly difficult._
|
| If you don't have a plan to sunset whatever you're building,
| you're basically telling your future self to go fuck himself.
| Unless you plan to quit first, in which case you're telling your
| successor to go fuck himself.
|
| _The source you 're looking at is not the code running in
| production._
|
| ((cries))
|
| _Mandatory code reviews do not automatically improve code
| quality nor reduce the frequency of incidents._
|
| The primary purpose of mandatory code reviews - without a
| sensible plan of who, when, why or how - is just for people to
| nitpick your code.
| marcosdumay wrote:
| A "plan to sunset" something looks like an incredibly alien
| concept to me. How do you do it when people keep finding new
| uses for whatever it is?
| 0xbadcafebee wrote:
| It's very hard, for that very reason that more of an
| organization will depend on it over time, making it harder to
| extricate from it. But there's a number of things that help.
|
| 0. Ownership. Try to manipulate the business to put this
| project under a part of the org where it's very hard for
| anyone to have leverage over you, so you can fight back when
| they try to pressure you to keep it going with no budget or
| staff. (Can you put it under "finance" or "admin" or "HR"?
| They won't give a shit about your project and aren't
| responsible to the tech leadership. Sometimes "IT" is the
| same.)
|
| 1. Money. Assign a fixed budget that runs out after X time.
| Put yourself in a position that you have no way to ask for
| more money, so they can't keep trying to stretch your team
| out with no additional funding. Calculate how much it'll cost
| the business to try to support it past EOL and put that
| figure where everyone can see it.
|
| 2. Limits. Design in very specific quotas and limits that
| give very reliable but limited functionality. If somebody
| wants this to scale 100x, show them how it literally can't.
| Prevent stakeholders from trying to do more than is possible.
| If they want fewer limits, tell them to give you money out of
| their budget to build and staff a single-tenant version of
| it. They will quickly go away, probably to (poorly) build
| their own version of it.
|
| 3. Disclosure. Tell all stakeholders what this thing can and
| can't do, that you won't be able to scale, what your SLA is,
| when this thing will be EOL and that they need to put on
| their calendars now to work to move off of it in time. Do not
| tell them the actual EOL date, tell them a date 6 months
| before the actual cut-off date. Communicate often and via
| various means in public places, because most people will
| never read anything they aren't interested in.
|
| 4. Stakeholder management. Tightly control who is using your
| system and what they're using it for. Document the downstream
| business risk. Make a big stink if somebody starts using your
| dinky little project with no funding for something mission
| critical. Remind them of how your limits and budget and SLA
| and design are all tied together and can't be worked around
| without redesigning the whole thing.
|
| 5. Transition planning. When your system goes away, something
| needs to take its place. At design phase, incorporate a
| timeline that includes a large chunk of time just for
| supporting getting people off the platform. Also plan for how
| you could offload the entire system onto some other system.
| Create a document that lists what a new system will need to
| have, so whoever is tasked with that will not build something
| that is impossible to transition to. At sunset time, redirect
| work towards the transition. Have a solid change management
| plan and get stakeholder sign-off.
|
| 6. Rigorously track the value created by this thing, or the
| value lost by trying to maintain it past its sunset date, and
| all business risks. Collect hard data. You will need it later
| to argue to senior leadership why keeping this thing online
| is a terrible idea.
| cpeterso wrote:
| For more on sunsetting or replacing legacy systems, check
| out Marianne Bellotti's "Kill It With Fire: Manage Aging
| Computer Systems (and Future Proof Modern Ones)". Here's a
| review of the book:
|
| https://www.usenix.org/publications/loginonline/kill-it-
| fire
| WJW wrote:
| With careful change management and access control. It's easy
| to turn off a system if you know for sure that nobody is
| using it anymore. Large companies and militaries do it all
| the time just fine so there is no reason a dynamic young
| startup shouldn't be able to do it. :)
|
| (As a non-sarcastic response: back when I was an officer in
| the navy, how we would get rid of systems was most definitely
| taken into account from the very start. Even before we
| started building ships or radars or whatever, budgets and
| dock space would be reserved ~30 years into the future to do
| the decommissioning work. I do realize that this works much
| better for established organizations that can be reasonably
| sure that they will exists in 30 years, after all most
| startups are hard pressed to last even two. That said,
| completely disregarding any planning on how to grow out of
| your current systems does seem to have been the case at all
| the startups I have consulted for and was a major part of
| technical debt. IMO a technical leader should know where the
| skeletons are hidden in the setup for their organization and
| know roughly in how many months/years they will no longer
| suffice. Then they can plan the replacement and/or upgrading
| of said skeletons accordingly)
| KineticLensman wrote:
| > How do you do it when people keep finding new uses for
| whatever it is
|
| Assuming there is an upgrade path it does make sense to plan
| to turn a thing off. The thing may consume resources /
| facilities that could be redeployed, operators could become
| available for other tasks (or require retraining, etc), or
| the controlling organisation might need to be restructured.
| There might be regulatory implications / costs, such as
| recycling or disposal of controlled substances etc.
|
| If a retired thing is 'pure software', disposal might be
| simplified, but if it has physical or facilities elements (as
| per military capabilities mentioned by a sibling poster),
| disposal can be decidedly non-trivial.
| dsr_ wrote:
| I can point to examples for at least 80% of these.
|
| Mostly not in public, though.
___________________________________________________________________
(page generated 2022-08-22 23:01 UTC)