[HN Gopher] Diary of a first-time on-call engineer
___________________________________________________________________
Diary of a first-time on-call engineer
Author : kiyanwang
Score : 52 points
Date : 2022-03-14 07:28 UTC (15 hours ago)
(HTM) web link (thenewstack.io)
(TXT) w3m dump (thenewstack.io)
| r1b wrote:
| Thanks for sharing this. We are going through a similar
| organizational shift at $JOB where a large team has been split
| into several sub-teams that own services. It was valuable to see
| how you adapted the on-call rotation to this new structure.
|
| On using a public channel for coordinating incident response, we
| have struggled in the past with stakeholders joining the meeting
| and offering well intended but ultimately distracting input
| during the incident. We've found it best to have one responder
| play the role of "incident commander" and manage external
| communication / rope in more stakeholders as needed. This helps
| avoid conditions that make the incident even more stressful, like
| say a member of upper management demanding frequent updates or
| spreading FUD.
| barbazoo wrote:
| > I pulled out my laptop, tethered my phone and popped online.
| Sitting at a playground picnic table, I re-ran the test that had
| failed and alerted me. It passed.
|
| Boy would I be annoyed if I got paged because someone else's
| commit made a test fail. This should be dealt with by the dev
| that pushed that change. The fact that it made a test fail should
| also mean it's not in prod yet, making me wonder why someone got
| paged in the first place.
| pkaeding wrote:
| It may have been a monitoring/smoke test type of thing, that
| performs some customer-like interaction with the deployed
| service, and makes sure the expected result comes out the other
| side, rather than a build-time test.
| srer wrote:
| Spent about a decade on-call...It is an interesting and perhaps
| even worthwhile experience to do it for a little bit.
|
| IMHO there are two angles to achieve higher reliability:
|
| 1. Build systems which go down less on their own (which is quite
| difficult)
|
| 2. Get good at fixing them fast when they go down (which is much
| easier)
|
| A trivial example of 1 might be moving from no RAID to RAID1, an
| example of 2 might be getting on-call proficient at quickly
| responding to a dead disk and restoring from backups.
|
| (But I did say 1. was hard, even in this simple example maybe you
| used hardware raid, and are finding out just how crappy HW raid
| can be ;)
|
| Companies attempt both angle 1 & 2 to differing degrees. I worked
| a lot in 2, and it sucked. We were a proficient fire department
| in a city of straw houses and gas stoves.
|
| Aside from the general suckage of on-call (24 hour days, many,
| many nights lost sleep), the work is by it's nature high risk,
| high urgency, high impact, high risk - but is rarely rewarded
| sufficiently in pay or promotion opportunities.
|
| Having spent so much time on angle 2, I've decided it is largely
| shallow work, and to grow intellectually I needed harder more
| interesting problems, which meant trying angle 1.
|
| Consequently now my title is SWE and I don't do out of hours on-
| call, and it is glorious. I do notice I see the world differently
| to my less scarred fellow SWE. It definitely changed how I write
| software, and view the product priorities.
|
| At heart I still consider myself an SRE. It's just I had to place
| myself in the most impactful place to achieve it, as a regular
| product/feature dev.
| tharkun__ wrote:
| I get it, trying to make on-call not sound so bad. Fair enough.
|
| But then she goes into the "example week" including a weekend
| page. It seems like there's not much going on, not many pages,
| not much out of regular hours. And then this prime example of why
| you do not want to have engineers on-call: At 5
| a.m., I got paged. I jumped out of bed and ran over to my
| computer. The page self-resolved at 5:01 a.m.
|
| She puts a smiley but this can really drive one mad. Second time
| this happens (and let's face it, it happens, even if you tweak
| and adjust those settings over and over) I'm out of there.
|
| I get it, all the talk about "you will build better services
| because otherwise you will be paged". In the end I think it's
| just about saving money for dedicated 24/7 network operations if
| that is what your business requires. If it doesn't have customers
| 24/7, then don't set up any pagers and support staff, if it does,
| pay for it!
|
| I've been on call in a SaaS environment before. In my entire time
| (counted in years) doing that I have had about the same amount of
| pages in total (!) as are contained in her example week. I
| carried my laptop and the phone to tether dutifully but I have
| never been paged by the 24/7 support staff for something like the
| 5:01 self-resolving alert.
| Psychlist wrote:
| My current on call setup that page would count as one hour
| worked of my weekly 40.
|
| Ideally I'd be paid extra as well (I get about an hours pay a
| week as an on call allowance). What makes it work is that I'm
| the senior dev on the project that I'm on call for - if it hits
| the fan it really is my problem. And more accurately, in the
| last two years I've been called out twice. So that extra hour a
| week of pay... it's just free money.
| [deleted]
| bauerd wrote:
| Hire in complimentary timezones so no one has to be on call at
| night
| [deleted]
| XorNot wrote:
| Reminds me of my first job: on call didn't change much at night
| because the system was so twitchy (as in, SMS every 15 minutes
| that then autoresolved) that I just put my phone on silent and
| ignored it over night.
|
| Of course we also had a night shift guy who would actually call
| you if he got stuck, so hell if I knew why the SMS system
| existed. (That role lasted about a year at most - night shift
| for tech is stupid).
| zwieback wrote:
| We had a team of 3 or 4 developers in a small company with a few
| shrink-wrapped SW products. We took turns being on call for a
| week at a time. At first it was scary but I learned so much,
| talking directly to end users and discussing their issues
| definitely made me a better developer. At first we had a "tech
| support" engineer but it turned out that 90% of the calls had to
| be escalated anyway so we just switched to developers handling
| the calls.
|
| At the next job I was also on call for systems we put on a
| manufacturing line. That was a lot more exciting and involved
| driving to the plant and try to get things going when production
| stops. Losing $10000 per hour when the line is stopped wakes you
| up real fast even if you just jumped out of bed 20 minutes ago.
| Psychlist wrote:
| I think what makes those work is being part of a small team.
| It's not so much that if Bob releases bad code you can go over
| and punch Bob, as that no-one wants to be responsible for
| waking their teammates up in the night to start with.
|
| Another case of: can be done properly is a good work
| environment, and can be literal death when done badly.
| [deleted]
| site-packages1 wrote:
| I don't do this anymore. Very draining to be an SRE. I remember
| one Christmas I worked until 2am because of SRE pages on software
| that, in retrospect, didn't need to even be up at the time.
| Google used to give us SRE bomber jackets, and while those were
| super cool in a nerdy way (I never got one because I wasn't an
| SRE there), it was clear the org had to really go out there with
| incentives to get people to sign up for SRE. I love the practice
| of reliability and building fault tolerant systems, but it's just
| so draining / energizing at the time, then you realize all your
| efforts (or at least mine and colleagues' at the large companies
| for which we worked) went into keeping stupid, unimpactful things
| from falling over briefly. Unless you're SRE for a real, not
| startup-tongue-in-cheek "mission critical" system it is
| absolutely not worth it. In my very cynical opinion, I'd need to
| be an SRE keeping a NASA rocket working, or keeping the ISS up,
| or working on some SDV system, before I would consider a system
| to be really mission critical.
| sdevonoes wrote:
| Sorry, but not. I keep my 9-5 strict. I'm not into the game of
| getting more money in exchange for my scarce free time. If the
| company needs people to work on Sunday mornings, they can hire
| them (i.e., SREs). I can't understand why it is becoming so
| normal for regular (senior) developers to become slaves of our
| companies by working more than 40h/week; the general excuse is
| "you build it, you run it, you fix it". If we are into writing
| robust software, I'm all in, but that's a totally different
| thing.
|
| Edit: I'm not in my 20s anymore and I have a family.
| bogantech wrote:
| The idea of "you build it you fix it" is to make sure things
| really are fixed because you value your time. And it works.
|
| When there's an ops team responsible for on-call nobody gives a
| shit about fixing the code issues that wake them up.
| Twisol wrote:
| The current development team didn't always build it, either.
| And the pressures and incentives are often biased away from
| improving things, no matter how much individual developers
| might want to.
|
| "You build it, you fix it" has to apply to the powers that be
| (i.e. the product managers and such), not just the individual
| contributors.
| dgritsko wrote:
| > After getting more information, we realized it was not
| affecting customers _and could wait until the team that owned the
| service came online_.
|
| This felt like a major red flag to me - why was the team that
| owned the offending service not the one receiving the page?
| barbazoo wrote:
| In many organizations the on-call is among a wider circle of
| people to make sure devs in small teams aren't on-call for
| their team's service every other week.
| beebmam wrote:
| That's a recipe for disaster in my experience. Specifically:
| on-call burnout
| dgritsko wrote:
| Interesting, I've had the opposite experience - an on-call
| rotation that's too wide makes it difficult to know how to
| respond effectively to an alert, since it's likely to
| involve a service that you may know nothing about.
| Macha wrote:
| Yeah, someone asking you every 15 minutes for an update
| while you're trying to read the beginner's guide to
| service X and connect it to their question isn't a fun
| time.
| bananashakes wrote:
| We considered that model, but decided against mandatory
| participation in off-hours on-call. The off-hours rotation is
| voluntary and paid.
|
| This is a trade-off in order to reduce the number of people who
| have to be on-call at a given time and maintain a voluntary
| model. Generally, though, the person on-call will be from a
| team that owns at least a portion of the services they're on-
| call for, and the services are distributed among the two
| rotations thoughtfully.
| SketchySeaBeast wrote:
| I can't help but notice none of these items actually needed to
| get fixed.
| robertcorey wrote:
| This person is straight-up brainwashed, lol.
| someelephant wrote:
| I'd say work addict is a better term. The dream of all
| employers.
| mirntyfirty wrote:
| I think that there's a tendency online to attempt to gaslight
| certain ideas into reality and it creates a tension because
| they are wrong but it can be unwise to call them out.
| SketchySeaBeast wrote:
| Don't worry, it's not robbing you of the the evenings and
| weekends, which "maybe you like", it's a fun new challenging
| opportunity. Have you had a chance to panic yet on Saturday
| morning at 3 AM? Would you like to? It's such a thrill.
| babyshake wrote:
| See also: exciting hackathons where you get to work late
| nights and weekends on "fun ideas" that the leadership
| would like implemented ASAP.
| VectorLock wrote:
| As soon as I read "volunteer on-call team" my wtf-o-meter got
| pegged.
| sdevonoes wrote:
| Probably the author is 22 years old or so; otherwise I can't
| understand it.
| VectorLock wrote:
| Although they said it was a "paid volunteer" (which I don't
| think is what a volunteer is) so maybe it was really good -
| time and a half at least I'd hope.
| pkaeding wrote:
| This is great. Having the engineers who built the thing be the
| same people who respond to pages is a great way to incentivize
| robust systems. If you built it, you likely know how to fix it
| better than anyone else, and if you don't want to be disturbed in
| the evening, you will think about how to better deal with faults
| during the early stages of development.
| [deleted]
| jbreckmckye wrote:
| I'm sorry, I'm going to have to chew you out here.
|
| Firstly: when was the last time your on call rotation comprised
| exclusively programs that you had written? How many systems
| does your team look after that it "inherited" from elsewhere?
|
| Secondly, how many organisations allow developers to prioritise
| solving pages over other forms of revenue generating work? How
| many teams demand negotiation with a "product owner" before a
| piece of work can be added to the board?
|
| Thirdly, how many pages are truly easy or tractable to fix? I
| have worked in teams where we were constantly paged due to
| external API failures, but intra team disagreements meant we
| could neither punt the responsibility elsewhere nor resolve the
| problem. The problem wasn't technical, it was social, but
| there's no PagerDuty for absentee Engineering Managers.
|
| Fourthly, how often is the Real Problem(tm) for the reliability
| and uptime of a given system, actually at the program level,
| and not at architecture or system level? How many programs are
| hamstrung by internals alone? Once you get past the basics,
| most of the _really_ significant decisions affecting
| reliability are at a system design level that Johnny or Jane
| Developer isn't going to be empowered to fix in a Scrum
| "sprint".
|
| Really, this "shitty on call incentivises robust systems"
| argument is facile. It's paper thin. Put it out in the sunlight
| for a second and it crumbles. It only makes any sense,
| tentatively, under idealised conditions where developers alone
| are responsible for non functional requirements and the usual
| relationship of employer / employee is suspended.
|
| Think about it for a second. It's just rot.
| aero142 wrote:
| I've been part of an on call rotation at my current job and
| previous job. Both had the same rules because I and other
| engineers insisted on it.
|
| * You are only on-call for systems you can directly change.
|
| * The person who is on call has full flexibility to work on
| anything that improves the life of the on-call person for
| that time.
|
| * Anything that wakes people up in the middle of the night
| gets prioritized above new feature work.
|
| * There is a pager set of monitors, and a message only set of
| monitors. If a page goes off, and there is nothing the on-
| call engineer can do about it, it gets moved to the message
| only channel or removed, because it is a bad monitor.
|
| I discussed this list of rules when I interviewed and the job
| description included being on-call. It wasn't a negotiation.
| If those aren't followed, I remove the monitors. I'm sorry
| you had a terrible work environment, but I encourage everyone
| to have professional standards. I hope you find a place where
| you can.
| sdevonoes wrote:
| Sorry, but not. I keep my 9-5 strict. I'm not into the game of
| getting more money in exchange for my scarce free time. If the
| company needs people to work on Sunday mornings, they can hire
| them (i.e., SREs). I can't understand why it is becoming so
| normal for regular (senior) developers to become slaves of our
| companies by working more than 40h/week; the general excuse is
| "you build it, you run it, you fix it". If we are into writing
| robust software, I'm all in, but that's a totally different
| thing.
| knicholes wrote:
| That sounds amazing until you get a system that is complicated
| and the part of the system you work on relies on an unstable
| dependency upon which you have no control. Oh no, my service
| isn't responding to 99.9% of requests within 250ms!? Boom, page
| at 2am. The service that is responsible for the alert either
| doesn't have monitoring set up correctly or at all or they
| aggregate their metrics differently, so on average, all of
| their calls are 5ms, but for all of your calls, maybe they're
| taking 2-3 seconds.
|
| It's a nightmare that I escaped after three years. I was driven
| from a happy person to someone who hated his life. It took me a
| while to realize it was my job. The worst part is my management
| told us that the on-call pay was already "baked into your
| salary." I switched jobs internally. No more on-call.
| Strangely, I was able to keep my pay without having to wake up
| in the middle of the night any more.
|
| Oh yeah, and you aren't the only one who wakes up. Someone
| sleeping in your bed may be an insomniac and may have just
| fallen asleep. Your alert wakes them. Now they don't get to
| sleep for the rest of the night. Now you and they are both
| sleep deprived and irritable. It obliterates personal
| relationships. Maybe your kids overhear you explaining to
| someone what is going on in the next room. Whenever anyone asks
| you about your job, PTSD triggers and you spend ten minutes
| venting.
|
| I am so, so, so glad I'm no longer on call. I don't know what
| they'd have to pay me to get back on it, but it's at least 2x,
| and even then, I couldn't last for over a year.
|
| You don't get to go to movies. You don't get to join your
| friends when they go biking or hiking or camping. You get
| interrupted in the middle of parent-teacher conference. You
| have a laptop next to you at the 4th of July party. You get
| interrupted in the middle of a shower. You don't get to host
| Thanksgiving or Christmas because you may have to work in the
| middle of making dinner. You're about the roll the dice to take
| the first turn of a game you've been promising your kids you'd
| play with them in the first attempt in a long time to be a
| decent parent, and your alarm goes off, you work for the next
| three hours, and they play without you.
|
| There is no 40-hour work week for software engineers. There is
| no union. You work your normal days then you also get to work
| normal nights. Then you work a normal day again either fixing
| the problem that caused the alert or spend 6 hours in post-
| mortems explaining what happened when instead you should be
| sleeping.
| willcipriano wrote:
| In a world where people leave every 1 - 2 years to stay current
| with the market, more often than not the failure will be
| related to what someone else did a long time ago.
| drivers99 wrote:
| Multiple holidays ruined by being on-call (New Years' Eve more
| than once due to year-end bugs, sitting in another room on a
| laptop and phone while everyone has Thanksgiving), countless
| weekends, disturbed sleep patterns. "Daring" to try to enjoy your
| weekend like this person said. I for one am tired of it.
| Psychlist wrote:
| WTF, you're on call when on holiday? That seems awful. I'd say
| "do teams really do that" but obviously they do. That's just
| terrible management IMO, and a real incentive to take up hiking
| or anything out of cellphone coverage.
| 0xbadcafebee wrote:
| Are there any books that describe building an on-call practice?
| If not, I volunteer to write one. It's not complicated, but a lot
| of teams miss some of the fundamentals that improve the overall
| results. On-call's purpose is not just to wake people up in the
| middle of the night, it's to drive continuous reduction of on-
| call incidents. You should get to the point where you don't even
| _need_ an on-call (but have it just in case).
| johncessna wrote:
| +1 the point of on call was to wake up the team who caused the
| problem in the first place, and presumably have the ability to
| fix it via code changes etc.
|
| This is in contrast to the model where there's a team that
| fields issues and tries their best to fix them, but ultimately
| has to put fix-it tickets to the software owners.
|
| The on call rotations I've seen end up being a hybrid where you
| get the worst of both worlds. The team has enough
| responsibilities that no one person on the team knows how to
| handle everything and are so inundated with pages that it just
| becomes accepted that one person from the team every X weeks is
| going to be nothing but keeping the service from burning down.
| sys_64738 wrote:
| Being on call is awful.
| debarshri wrote:
| I was a dev turned SRE in one point in my life. Man, the anxiety
| basically ruined my life. I could see I became very hyper person
| checking my phone all the time, to the point that it impacted my
| relationship at that point. If you are inheriting a bad project
| or service without lot of automated remediations, or processes,
| paired with a toxic work environment, just don't. No one prepares
| you for the anxiety the role brings itself with. In the end you
| would realise the work and 2x-3x rate per hour is probably not
| worth it. I have seen people getting burned out. You have to have
| tools and certain processes in place for an SRE/on-call duties.
| Also, slack notification on phone are actually the worst as brain
| always thinks that got a message for some reason when you keep
| your phone in your pocket. Just my opinion.
| rr808 wrote:
| Yeah if it gets too full on it fries your nerves, it also
| ruined my concentration. When I went back to pure dev I
| couldn't sit and work because I always was waiting for some
| alert to go off.
| debarshri wrote:
| Meditating helped me regain my focus so far.
___________________________________________________________________
(page generated 2022-03-14 23:00 UTC)