[HN Gopher] Diary of a first-time on-call engineer
       ___________________________________________________________________
        
       Diary of a first-time on-call engineer
        
       Author : kiyanwang
       Score  : 52 points
       Date   : 2022-03-14 07:28 UTC (15 hours ago)
        
 (HTM) web link (thenewstack.io)
 (TXT) w3m dump (thenewstack.io)
        
       | r1b wrote:
       | Thanks for sharing this. We are going through a similar
       | organizational shift at $JOB where a large team has been split
       | into several sub-teams that own services. It was valuable to see
       | how you adapted the on-call rotation to this new structure.
       | 
       | On using a public channel for coordinating incident response, we
       | have struggled in the past with stakeholders joining the meeting
       | and offering well intended but ultimately distracting input
       | during the incident. We've found it best to have one responder
       | play the role of "incident commander" and manage external
       | communication / rope in more stakeholders as needed. This helps
       | avoid conditions that make the incident even more stressful, like
       | say a member of upper management demanding frequent updates or
       | spreading FUD.
        
       | barbazoo wrote:
       | > I pulled out my laptop, tethered my phone and popped online.
       | Sitting at a playground picnic table, I re-ran the test that had
       | failed and alerted me. It passed.
       | 
       | Boy would I be annoyed if I got paged because someone else's
       | commit made a test fail. This should be dealt with by the dev
       | that pushed that change. The fact that it made a test fail should
       | also mean it's not in prod yet, making me wonder why someone got
       | paged in the first place.
        
         | pkaeding wrote:
         | It may have been a monitoring/smoke test type of thing, that
         | performs some customer-like interaction with the deployed
         | service, and makes sure the expected result comes out the other
         | side, rather than a build-time test.
        
       | srer wrote:
       | Spent about a decade on-call...It is an interesting and perhaps
       | even worthwhile experience to do it for a little bit.
       | 
       | IMHO there are two angles to achieve higher reliability:
       | 
       | 1. Build systems which go down less on their own (which is quite
       | difficult)
       | 
       | 2. Get good at fixing them fast when they go down (which is much
       | easier)
       | 
       | A trivial example of 1 might be moving from no RAID to RAID1, an
       | example of 2 might be getting on-call proficient at quickly
       | responding to a dead disk and restoring from backups.
       | 
       | (But I did say 1. was hard, even in this simple example maybe you
       | used hardware raid, and are finding out just how crappy HW raid
       | can be ;)
       | 
       | Companies attempt both angle 1 & 2 to differing degrees. I worked
       | a lot in 2, and it sucked. We were a proficient fire department
       | in a city of straw houses and gas stoves.
       | 
       | Aside from the general suckage of on-call (24 hour days, many,
       | many nights lost sleep), the work is by it's nature high risk,
       | high urgency, high impact, high risk - but is rarely rewarded
       | sufficiently in pay or promotion opportunities.
       | 
       | Having spent so much time on angle 2, I've decided it is largely
       | shallow work, and to grow intellectually I needed harder more
       | interesting problems, which meant trying angle 1.
       | 
       | Consequently now my title is SWE and I don't do out of hours on-
       | call, and it is glorious. I do notice I see the world differently
       | to my less scarred fellow SWE. It definitely changed how I write
       | software, and view the product priorities.
       | 
       | At heart I still consider myself an SRE. It's just I had to place
       | myself in the most impactful place to achieve it, as a regular
       | product/feature dev.
        
       | tharkun__ wrote:
       | I get it, trying to make on-call not sound so bad. Fair enough.
       | 
       | But then she goes into the "example week" including a weekend
       | page. It seems like there's not much going on, not many pages,
       | not much out of regular hours. And then this prime example of why
       | you do not want to have engineers on-call:                   At 5
       | a.m., I got paged. I jumped out of bed and ran over to my
       | computer.         The page self-resolved at 5:01 a.m.
       | 
       | She puts a smiley but this can really drive one mad. Second time
       | this happens (and let's face it, it happens, even if you tweak
       | and adjust those settings over and over) I'm out of there.
       | 
       | I get it, all the talk about "you will build better services
       | because otherwise you will be paged". In the end I think it's
       | just about saving money for dedicated 24/7 network operations if
       | that is what your business requires. If it doesn't have customers
       | 24/7, then don't set up any pagers and support staff, if it does,
       | pay for it!
       | 
       | I've been on call in a SaaS environment before. In my entire time
       | (counted in years) doing that I have had about the same amount of
       | pages in total (!) as are contained in her example week. I
       | carried my laptop and the phone to tether dutifully but I have
       | never been paged by the 24/7 support staff for something like the
       | 5:01 self-resolving alert.
        
         | Psychlist wrote:
         | My current on call setup that page would count as one hour
         | worked of my weekly 40.
         | 
         | Ideally I'd be paid extra as well (I get about an hours pay a
         | week as an on call allowance). What makes it work is that I'm
         | the senior dev on the project that I'm on call for - if it hits
         | the fan it really is my problem. And more accurately, in the
         | last two years I've been called out twice. So that extra hour a
         | week of pay... it's just free money.
        
         | [deleted]
        
         | bauerd wrote:
         | Hire in complimentary timezones so no one has to be on call at
         | night
        
           | [deleted]
        
         | XorNot wrote:
         | Reminds me of my first job: on call didn't change much at night
         | because the system was so twitchy (as in, SMS every 15 minutes
         | that then autoresolved) that I just put my phone on silent and
         | ignored it over night.
         | 
         | Of course we also had a night shift guy who would actually call
         | you if he got stuck, so hell if I knew why the SMS system
         | existed. (That role lasted about a year at most - night shift
         | for tech is stupid).
        
       | zwieback wrote:
       | We had a team of 3 or 4 developers in a small company with a few
       | shrink-wrapped SW products. We took turns being on call for a
       | week at a time. At first it was scary but I learned so much,
       | talking directly to end users and discussing their issues
       | definitely made me a better developer. At first we had a "tech
       | support" engineer but it turned out that 90% of the calls had to
       | be escalated anyway so we just switched to developers handling
       | the calls.
       | 
       | At the next job I was also on call for systems we put on a
       | manufacturing line. That was a lot more exciting and involved
       | driving to the plant and try to get things going when production
       | stops. Losing $10000 per hour when the line is stopped wakes you
       | up real fast even if you just jumped out of bed 20 minutes ago.
        
         | Psychlist wrote:
         | I think what makes those work is being part of a small team.
         | It's not so much that if Bob releases bad code you can go over
         | and punch Bob, as that no-one wants to be responsible for
         | waking their teammates up in the night to start with.
         | 
         | Another case of: can be done properly is a good work
         | environment, and can be literal death when done badly.
        
       | [deleted]
        
       | site-packages1 wrote:
       | I don't do this anymore. Very draining to be an SRE. I remember
       | one Christmas I worked until 2am because of SRE pages on software
       | that, in retrospect, didn't need to even be up at the time.
       | Google used to give us SRE bomber jackets, and while those were
       | super cool in a nerdy way (I never got one because I wasn't an
       | SRE there), it was clear the org had to really go out there with
       | incentives to get people to sign up for SRE. I love the practice
       | of reliability and building fault tolerant systems, but it's just
       | so draining / energizing at the time, then you realize all your
       | efforts (or at least mine and colleagues' at the large companies
       | for which we worked) went into keeping stupid, unimpactful things
       | from falling over briefly. Unless you're SRE for a real, not
       | startup-tongue-in-cheek "mission critical" system it is
       | absolutely not worth it. In my very cynical opinion, I'd need to
       | be an SRE keeping a NASA rocket working, or keeping the ISS up,
       | or working on some SDV system, before I would consider a system
       | to be really mission critical.
        
       | sdevonoes wrote:
       | Sorry, but not. I keep my 9-5 strict. I'm not into the game of
       | getting more money in exchange for my scarce free time. If the
       | company needs people to work on Sunday mornings, they can hire
       | them (i.e., SREs). I can't understand why it is becoming so
       | normal for regular (senior) developers to become slaves of our
       | companies by working more than 40h/week; the general excuse is
       | "you build it, you run it, you fix it". If we are into writing
       | robust software, I'm all in, but that's a totally different
       | thing.
       | 
       | Edit: I'm not in my 20s anymore and I have a family.
        
         | bogantech wrote:
         | The idea of "you build it you fix it" is to make sure things
         | really are fixed because you value your time. And it works.
         | 
         | When there's an ops team responsible for on-call nobody gives a
         | shit about fixing the code issues that wake them up.
        
           | Twisol wrote:
           | The current development team didn't always build it, either.
           | And the pressures and incentives are often biased away from
           | improving things, no matter how much individual developers
           | might want to.
           | 
           | "You build it, you fix it" has to apply to the powers that be
           | (i.e. the product managers and such), not just the individual
           | contributors.
        
       | dgritsko wrote:
       | > After getting more information, we realized it was not
       | affecting customers _and could wait until the team that owned the
       | service came online_.
       | 
       | This felt like a major red flag to me - why was the team that
       | owned the offending service not the one receiving the page?
        
         | barbazoo wrote:
         | In many organizations the on-call is among a wider circle of
         | people to make sure devs in small teams aren't on-call for
         | their team's service every other week.
        
           | beebmam wrote:
           | That's a recipe for disaster in my experience. Specifically:
           | on-call burnout
        
             | dgritsko wrote:
             | Interesting, I've had the opposite experience - an on-call
             | rotation that's too wide makes it difficult to know how to
             | respond effectively to an alert, since it's likely to
             | involve a service that you may know nothing about.
        
               | Macha wrote:
               | Yeah, someone asking you every 15 minutes for an update
               | while you're trying to read the beginner's guide to
               | service X and connect it to their question isn't a fun
               | time.
        
         | bananashakes wrote:
         | We considered that model, but decided against mandatory
         | participation in off-hours on-call. The off-hours rotation is
         | voluntary and paid.
         | 
         | This is a trade-off in order to reduce the number of people who
         | have to be on-call at a given time and maintain a voluntary
         | model. Generally, though, the person on-call will be from a
         | team that owns at least a portion of the services they're on-
         | call for, and the services are distributed among the two
         | rotations thoughtfully.
        
       | SketchySeaBeast wrote:
       | I can't help but notice none of these items actually needed to
       | get fixed.
        
       | robertcorey wrote:
       | This person is straight-up brainwashed, lol.
        
         | someelephant wrote:
         | I'd say work addict is a better term. The dream of all
         | employers.
        
         | mirntyfirty wrote:
         | I think that there's a tendency online to attempt to gaslight
         | certain ideas into reality and it creates a tension because
         | they are wrong but it can be unwise to call them out.
        
           | SketchySeaBeast wrote:
           | Don't worry, it's not robbing you of the the evenings and
           | weekends, which "maybe you like", it's a fun new challenging
           | opportunity. Have you had a chance to panic yet on Saturday
           | morning at 3 AM? Would you like to? It's such a thrill.
        
             | babyshake wrote:
             | See also: exciting hackathons where you get to work late
             | nights and weekends on "fun ideas" that the leadership
             | would like implemented ASAP.
        
         | VectorLock wrote:
         | As soon as I read "volunteer on-call team" my wtf-o-meter got
         | pegged.
        
           | sdevonoes wrote:
           | Probably the author is 22 years old or so; otherwise I can't
           | understand it.
        
             | VectorLock wrote:
             | Although they said it was a "paid volunteer" (which I don't
             | think is what a volunteer is) so maybe it was really good -
             | time and a half at least I'd hope.
        
       | pkaeding wrote:
       | This is great. Having the engineers who built the thing be the
       | same people who respond to pages is a great way to incentivize
       | robust systems. If you built it, you likely know how to fix it
       | better than anyone else, and if you don't want to be disturbed in
       | the evening, you will think about how to better deal with faults
       | during the early stages of development.
        
         | [deleted]
        
         | jbreckmckye wrote:
         | I'm sorry, I'm going to have to chew you out here.
         | 
         | Firstly: when was the last time your on call rotation comprised
         | exclusively programs that you had written? How many systems
         | does your team look after that it "inherited" from elsewhere?
         | 
         | Secondly, how many organisations allow developers to prioritise
         | solving pages over other forms of revenue generating work? How
         | many teams demand negotiation with a "product owner" before a
         | piece of work can be added to the board?
         | 
         | Thirdly, how many pages are truly easy or tractable to fix? I
         | have worked in teams where we were constantly paged due to
         | external API failures, but intra team disagreements meant we
         | could neither punt the responsibility elsewhere nor resolve the
         | problem. The problem wasn't technical, it was social, but
         | there's no PagerDuty for absentee Engineering Managers.
         | 
         | Fourthly, how often is the Real Problem(tm) for the reliability
         | and uptime of a given system, actually at the program level,
         | and not at architecture or system level? How many programs are
         | hamstrung by internals alone? Once you get past the basics,
         | most of the _really_ significant decisions affecting
         | reliability are at a system design level that Johnny or Jane
         | Developer isn't going to be empowered to fix in a Scrum
         | "sprint".
         | 
         | Really, this "shitty on call incentivises robust systems"
         | argument is facile. It's paper thin. Put it out in the sunlight
         | for a second and it crumbles. It only makes any sense,
         | tentatively, under idealised conditions where developers alone
         | are responsible for non functional requirements and the usual
         | relationship of employer / employee is suspended.
         | 
         | Think about it for a second. It's just rot.
        
           | aero142 wrote:
           | I've been part of an on call rotation at my current job and
           | previous job. Both had the same rules because I and other
           | engineers insisted on it.
           | 
           | * You are only on-call for systems you can directly change.
           | 
           | * The person who is on call has full flexibility to work on
           | anything that improves the life of the on-call person for
           | that time.
           | 
           | * Anything that wakes people up in the middle of the night
           | gets prioritized above new feature work.
           | 
           | * There is a pager set of monitors, and a message only set of
           | monitors. If a page goes off, and there is nothing the on-
           | call engineer can do about it, it gets moved to the message
           | only channel or removed, because it is a bad monitor.
           | 
           | I discussed this list of rules when I interviewed and the job
           | description included being on-call. It wasn't a negotiation.
           | If those aren't followed, I remove the monitors. I'm sorry
           | you had a terrible work environment, but I encourage everyone
           | to have professional standards. I hope you find a place where
           | you can.
        
         | sdevonoes wrote:
         | Sorry, but not. I keep my 9-5 strict. I'm not into the game of
         | getting more money in exchange for my scarce free time. If the
         | company needs people to work on Sunday mornings, they can hire
         | them (i.e., SREs). I can't understand why it is becoming so
         | normal for regular (senior) developers to become slaves of our
         | companies by working more than 40h/week; the general excuse is
         | "you build it, you run it, you fix it". If we are into writing
         | robust software, I'm all in, but that's a totally different
         | thing.
        
         | knicholes wrote:
         | That sounds amazing until you get a system that is complicated
         | and the part of the system you work on relies on an unstable
         | dependency upon which you have no control. Oh no, my service
         | isn't responding to 99.9% of requests within 250ms!? Boom, page
         | at 2am. The service that is responsible for the alert either
         | doesn't have monitoring set up correctly or at all or they
         | aggregate their metrics differently, so on average, all of
         | their calls are 5ms, but for all of your calls, maybe they're
         | taking 2-3 seconds.
         | 
         | It's a nightmare that I escaped after three years. I was driven
         | from a happy person to someone who hated his life. It took me a
         | while to realize it was my job. The worst part is my management
         | told us that the on-call pay was already "baked into your
         | salary." I switched jobs internally. No more on-call.
         | Strangely, I was able to keep my pay without having to wake up
         | in the middle of the night any more.
         | 
         | Oh yeah, and you aren't the only one who wakes up. Someone
         | sleeping in your bed may be an insomniac and may have just
         | fallen asleep. Your alert wakes them. Now they don't get to
         | sleep for the rest of the night. Now you and they are both
         | sleep deprived and irritable. It obliterates personal
         | relationships. Maybe your kids overhear you explaining to
         | someone what is going on in the next room. Whenever anyone asks
         | you about your job, PTSD triggers and you spend ten minutes
         | venting.
         | 
         | I am so, so, so glad I'm no longer on call. I don't know what
         | they'd have to pay me to get back on it, but it's at least 2x,
         | and even then, I couldn't last for over a year.
         | 
         | You don't get to go to movies. You don't get to join your
         | friends when they go biking or hiking or camping. You get
         | interrupted in the middle of parent-teacher conference. You
         | have a laptop next to you at the 4th of July party. You get
         | interrupted in the middle of a shower. You don't get to host
         | Thanksgiving or Christmas because you may have to work in the
         | middle of making dinner. You're about the roll the dice to take
         | the first turn of a game you've been promising your kids you'd
         | play with them in the first attempt in a long time to be a
         | decent parent, and your alarm goes off, you work for the next
         | three hours, and they play without you.
         | 
         | There is no 40-hour work week for software engineers. There is
         | no union. You work your normal days then you also get to work
         | normal nights. Then you work a normal day again either fixing
         | the problem that caused the alert or spend 6 hours in post-
         | mortems explaining what happened when instead you should be
         | sleeping.
        
         | willcipriano wrote:
         | In a world where people leave every 1 - 2 years to stay current
         | with the market, more often than not the failure will be
         | related to what someone else did a long time ago.
        
       | drivers99 wrote:
       | Multiple holidays ruined by being on-call (New Years' Eve more
       | than once due to year-end bugs, sitting in another room on a
       | laptop and phone while everyone has Thanksgiving), countless
       | weekends, disturbed sleep patterns. "Daring" to try to enjoy your
       | weekend like this person said. I for one am tired of it.
        
         | Psychlist wrote:
         | WTF, you're on call when on holiday? That seems awful. I'd say
         | "do teams really do that" but obviously they do. That's just
         | terrible management IMO, and a real incentive to take up hiking
         | or anything out of cellphone coverage.
        
       | 0xbadcafebee wrote:
       | Are there any books that describe building an on-call practice?
       | If not, I volunteer to write one. It's not complicated, but a lot
       | of teams miss some of the fundamentals that improve the overall
       | results. On-call's purpose is not just to wake people up in the
       | middle of the night, it's to drive continuous reduction of on-
       | call incidents. You should get to the point where you don't even
       | _need_ an on-call (but have it just in case).
        
         | johncessna wrote:
         | +1 the point of on call was to wake up the team who caused the
         | problem in the first place, and presumably have the ability to
         | fix it via code changes etc.
         | 
         | This is in contrast to the model where there's a team that
         | fields issues and tries their best to fix them, but ultimately
         | has to put fix-it tickets to the software owners.
         | 
         | The on call rotations I've seen end up being a hybrid where you
         | get the worst of both worlds. The team has enough
         | responsibilities that no one person on the team knows how to
         | handle everything and are so inundated with pages that it just
         | becomes accepted that one person from the team every X weeks is
         | going to be nothing but keeping the service from burning down.
        
       | sys_64738 wrote:
       | Being on call is awful.
        
       | debarshri wrote:
       | I was a dev turned SRE in one point in my life. Man, the anxiety
       | basically ruined my life. I could see I became very hyper person
       | checking my phone all the time, to the point that it impacted my
       | relationship at that point. If you are inheriting a bad project
       | or service without lot of automated remediations, or processes,
       | paired with a toxic work environment, just don't. No one prepares
       | you for the anxiety the role brings itself with. In the end you
       | would realise the work and 2x-3x rate per hour is probably not
       | worth it. I have seen people getting burned out. You have to have
       | tools and certain processes in place for an SRE/on-call duties.
       | Also, slack notification on phone are actually the worst as brain
       | always thinks that got a message for some reason when you keep
       | your phone in your pocket. Just my opinion.
        
         | rr808 wrote:
         | Yeah if it gets too full on it fries your nerves, it also
         | ruined my concentration. When I went back to pure dev I
         | couldn't sit and work because I always was waiting for some
         | alert to go off.
        
           | debarshri wrote:
           | Meditating helped me regain my focus so far.
        
       ___________________________________________________________________
       (page generated 2022-03-14 23:00 UTC)