hngopher.com

       [HN Gopher] Show HN: I built an open-source tool to make on-call...
       ___________________________________________________________________
        
       Show HN: I built an open-source tool to make on-call suck less
        
       Hey HN,  I am building an open source platform to make on-call
       better and less stressful for engineers. We are building a tool
       that can silence alerts and help with debugging and root cause
       analysis. We also want to automate tedious parts of being on-call
       (running runbooks manually, answering questions on Slack, dealing
       with Pagerduty). Here is a quick video of how it works:
       https://youtu.be/m_K9Dq1kZDw  I hated being on-call for a couple of
       reasons:  * Alert volume: The number of alerts kept increasing over
       time. It was hard to maintain existing alerts. This would lead to a
       lot of noisy and unactionable alerts. I have lost count of the
       number of times I got woken up by alert that auto-resolved 5
       minutes later.  * Debugging: Debugging an alert or a customer
       support ticket would need me to gain context on a service that I
       might not have worked on before. These companies used many
       observability tools that would make debugging challenging. There
       are always a time pressure to resolve issues quickly.  There were
       some more tangential issues that used to take up a lot of on-call
       time  * Support: Answering questions from other teams. A lot of
       times these questions were repetitive and have been answered
       before.  * Dealing with PagerDuty: These tools are hard to use.
       e.g. It was hard to schedule an override in PD or do holiday
       schedules.  I am building an on-call tool that is Slack-native
       since that has become the de-facto tool for on-call engineers.  We
       heard from a lot of engineers that maintaining good alert hygiene
       is a challenge.  To start off, Opslane integrates with Datadog and
       can classify alerts as actionable or noisy.  We analyze your alert
       history across various signals:  1. Alert frequency  2. How quickly
       the alerts have resolved in the past  3. Alert priority  4. Alert
       response history  Our classification is conservative and it can be
       tuned as teams get more confidence in the predictions. We want to
       make sure that you aren't accidentally missing a critical alert.
       Additionally, we generate a weekly report based on all your alerts
       to give you a picture of your overall alert hygiene.  What's next?
       1. Building more integrations (Prometheus, Splunk, Sentry,
       PagerDuty) to continue making on-call quality of life better  2.
       Help make debugging and root cause analysis easier.  3. Runbook
       automation  We're still pretty early in development and we want to
       make on-call quality of life better. Any feedback would be much
       appreciated!
        
       Author : aray07
       Score  : 89 points
       Date   : 2024-07-27 13:53 UTC (1 days ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | LunarFrost88 wrote:
       | Really cool!
        
       | tryauuum wrote:
       | every time I see notifications in Slack / Telegram it makes me
       | depressed. Text messengers were not designed for this. If you get
       | the "something is wrong" alert it becomes part of history, it
       | won't re-alert you if it's still present. And if you have more
       | than one type of alert it will be lost in history
       | 
       | I guess alerts to messengers are OK as long it's only a couple
       | manually created ones, and there should be a graphical dashboard
       | to learn the rest of problems
        
         | aray07 wrote:
         | Yeah, I agree that slack is not the best medium for alerts. I
         | think we it has somewhat become the default in teams is that it
         | makes it easy to collaborate while debugging. I don't know a
         | good way to substitute that and share information.
         | 
         | What strategies have you seen work well?
        
         | stackskipton wrote:
         | Why? We send alerts to Slack and Pagerduty. Slack is to help
         | everyone who might be working, PagerDuty alerts the persons who
         | are actually in charge of working on it.
        
           | Aeolun wrote:
           | Yeah, I think it's convenient. We use email, but for the same
           | thing. If I inadvertedly break something, I'll have an email
           | in my inbox 5 minutes later.
        
         | northrup wrote:
         | THIS. Whispering into a slack channel off hours isn't a way to
         | get on-call support help nor is dropping alerts in one. If it's
         | a critical issue I'm going to need a page of some kind. Either
         | from something like PagerDuty or directly wired up SMS
         | messaging.
        
       | lmeyerov wrote:
       | Big fan of this direction. The architecture resonates! The base
       | lining is interesting, I'm curious how you think about that, esp
       | for bootstrapping initially + ongoing.
       | 
       | We are working on a variant being used more by investigative
       | teams than IT ops - so think IR, fraud, misinfo, etc - which has
       | similarities but also domain differences. If of interest to
       | someone with an operational infosec background (hunt, IR, secops)
       | , and esp US-based, the Louie.AI team is hiring an SE + principal
       | here.
        
       | racka wrote:
       | Really cool!
       | 
       | Anyone know of a similar alert UI for data/business alarms (eg
       | installs dropping WoW, crashes spiking DoD, etc)?
       | 
       | Something that feeds of Snowflake/BigQuery, but with a similar
       | nice UI so that you can quickly see false positives and silence
       | them.
       | 
       | The tools I've used so far (mostly in-house built) have all ended
       | in a spammy slack channel that no one ever checks anymore.
        
       | RadiozRadioz wrote:
       | > Slack-native since that has become the de-facto tool for on-
       | call engineers.
       | 
       | In your particular organization. Slack is one of many instant
       | messaging platforms. Tightly coupling your tool to Slack instead
       | of making it platform agnostic immediately restricts where it can
       | be used.
       | 
       | Other comment threads are already discussing the broader issues
       | with using IM for this job, so I won't go into it here.
       | 
       | Regardless, well done for making something.
        
         | aray07 wrote:
         | Thanks for the feedback. We want to get something out quickly
         | and we had experience working with Slack so it made sense for
         | us to start there.
         | 
         | However, the design is pretty flexible and we don't want to tie
         | ourselves to a single platform either.
        
         | FooBarWidget wrote:
         | Try Netherlands. We're Microsoft land over here. Pretty much
         | everyone is on Azure and Teams. It's mostly startups and hip
         | small companies that use Slack.
        
           | satyamkapoor wrote:
           | Startups, hip small companies, tech product based companies.
           | Most non tech product based or enterprise banks in NL are on
           | Teams
        
             | Aeolun wrote:
             | I really feel like the world would be a better place if it
             | was illegal to bundle Teams like this...
        
       | solatic wrote:
       | In my current workplace (BigCo), we know exactly what's wrong
       | with our alert system. We get alerts that we can't shut off,
       | because they (legitimately) represent customer downtime, and
       | whose root cause we either can't identify (lack of observability
       | infrastructure) or can't fix (the fix is non-trivial and
       | management won't prioritize).
       | 
       | Running on-call well is a culture problem. You need management to
       | prioritize observability (you can't fix what you can't show as
       | being broken), then you need management to build a no-broken-
       | windows culture (feature development stops if anything is
       | broken).
       | 
       | Technical tools cannot fix culture problems!
       | 
       | edit: management not talking to engineers, or being aware of
       | problems and deciding not to prioritize fixing them, are both
       | culture problems. The way you fix culture problems, as someone
       | who is not in management, is to either turn your brain off and
       | accept that life is imperfect (i.e. fix yourself instead of the
       | root cause), or to find a different job (i.e. if the culture
       | problem is so bad that it's leading to burnout). In any event,
       | cultural problems _cannot_ be solved with technical tools.
        
         | aray07 wrote:
         | I completely agree that technical tools cannot fix culture
         | problems.
         | 
         | However, one of the things that I noticed in my previous
         | companies was that my management chain wasn't even aware that
         | the problem was this bad.
         | 
         | We also wanted to add better reporting (like the alert
         | analytics) so that people have more visibility into the state
         | of alerts + on-call load on engineers.
         | 
         | What strategies have worked well for you when it comes to
         | management prioritizing these problems?
        
           | dennis_jeeves2 wrote:
           | >However, one of the things that I noticed in my previous
           | companies was that my management chain wasn't even aware that
           | the problem was this bad.
           | 
           | Isn't that a cultural problem?
        
           | djbusby wrote:
           | Show them the costs! Wasted time, wasted resources, wasted
           | money. Show the waste and come with the plan to reduce the
           | waste. Alerts, on-calls and tests are all waste reduction.
           | 
           | "We're paying down our technical debt"
        
         | hoistbypetard wrote:
         | That's true. But technical tools can help you highlight culture
         | problems so that they're easier to to discuss and fix. It's
         | been a minute since I've had to process exactly the kind of on-
         | call/alert problem we're discussing here, but this does feel
         | like the kind of tool that would help sell the kinds of
         | management/culture changes necessary to really improve things,
         | if not fix all of them.
        
           | djbusby wrote:
           | Switching tools, or adopting new (unproven) ones doesn't
           | address or fix the communication issue.
           | 
           | The existing tools mentioned can show the metrics. Management
           | needs an education - and that is part of the engineering job.
        
             | Aeolun wrote:
             | > Management needs an education - and that is part of the
             | engineering job
             | 
             | Isn't that bizarre? In all my years as an engineer I can
             | count the number of managers that went to learn about
             | engineering by themselves, on one hand.
             | 
             | It's literally their job, but somehow they feel they can do
             | it without understanding it.
        
         | cyanydeez wrote:
         | Obviously, the best way to get management's attention is to
         | start a stop and frisk customer engagement plan.
        
         | __turbobrew__ wrote:
         | I work on a team which runs hyper critical infra on all
         | production machines at BigCo and have the same experience as
         | you.
         | 
         | The problem are not the alerts -- the alerts actually are
         | catching real problems -- the problem is the following:
         | 
         | 1. The team is understaffed so sometimes spending a few days
         | root causing an alert is not prioritized 2. When alerts are
         | root caused sometimes the work to fix the root cause is not
         | prioritized 3. A culture on the team which allows alerts to go
         | untriaged due to desensitization.
         | 
         | Our headcount got reduced by ~40% and -- surprise surprise --
         | reliability and on-call got much worse. Senior leadership has
         | made the decision that the cost cuts are worth the decreased
         | reliability so nothing is going to change.
         | 
         | The job market is rough so people put up with this for now.
        
         | whazor wrote:
         | Or maybe page your managers, such that they can escalate the
         | situation. They will be more aligned on solving the cultural
         | problems if they get waked up too.
        
           | aray07 wrote:
           | yeah the best managers i worked with used to be on the same
           | on-call rotation such that they would also get paged every
           | time. That helped build empathy and visibility into the
           | situation.
        
           | blitzar wrote:
           | Or maybe page your managers, such that they can fire you
        
             | jobtemp wrote:
             | Then... problem solved!
        
         | efxhoy wrote:
         | > Running on-call well is a culture problem. You need
         | management to prioritize observability (you can't fix what you
         | can't show as being broken), then you need management to build
         | a no-broken-windows culture (feature development stops if
         | anything is broken).
         | 
         | I was lucky enough to join a company where management does
         | this. The managers were made to do this by experienced
         | engineers who explained to them in no uncertain terms that
         | stuff was broken and nothing was being shipped until things
         | stopped being broken. Unless you have good managers this won't
         | happen without a fight and it's a fight I think we as engineers
         | need to take.
         | 
         | Some managers in other teams played the "oh it's not super high
         | impact it's not prioritized" game, and those teams now own a
         | bunch of broken stuff and make very slow progress because their
         | developers are tiptoeing around broken glass, and end up
         | building even more broken stuff because nothing they own is
         | robust. Those managers played themselves.
         | 
         | Communication with management is bidirectional, sometimes they
         | need a lot of persuasion.
        
       | maximinus_thrax wrote:
       | Nice work, I always appreciate the contribution to the OSS
       | ecosystem.
       | 
       | That said, I like that you're 'saying out loud' with this. Slack
       | and other similar comm tooling has always been advertised as a
       | productivity booster due to their 'async' nature. Nobody actually
       | believes this anymore and coupling it with the oncall
       | notifications really closes the lid on that thing.
        
         | aray07 wrote:
         | Yeah, unfortunately, I don't think these messaging tools are
         | async. During oncall, I used to pretty much live on Slack.
         | Incidents were on slack, customer tickets on slack, debugging
         | on slack...
        
           | maximinus_thrax wrote:
           | That is correct, they are not. My former workplace had
           | Pagerduty integrated with Slack, so I get it...
        
       | lars_francke wrote:
       | Shameless question tangential related to the topic.
       | 
       | We are based in Europe and have the problem that some of us
       | sometimes just forget we're on call or are afraid that we'll miss
       | OpsGenie notifications.
       | 
       | We're desparately looking for a hardware solution. I'd like
       | something similar to the pagers of the past but at least here in
       | Germany they don't really seem to exist anymore. Ideally I'd have
       | a Bluetooth dongle that alerts me on configurable notifications
       | on my phone. Carrying this dongle for the week would be a
       | physical reminder I'm on call.
       | 
       | Does anyone know anything?
        
         | michaelt wrote:
         | A candy bar cell phone, paid for by your employer and handed to
         | whoever is on call. People who don't want it can just forward
         | it to their phone.
        
           | crawfishphase wrote:
           | in this case a satellite enabled candybar. the disaster
           | recovery policy and budget should be applicable here. make
           | sure its able to share xg and satellite tunnel for maximum
           | value. ensure the reporting system is satellite enabled also.
           | added points if its sending alerts 2 your handy byod.
           | Disaster recovery is a big deal in 2024. All sorts of factors
           | make satellite redundancy valuable in todays reality:
           | Coworkers on a hike or a boat, random 0-day stuff, and war
           | can cut your normal internet.. i have experienced all of
           | these and only in the last 4 years and more than 1 time on
           | each topic. Train your users to destroy it in case of war as
           | its trackable by military tech. Put a sticker on it. Check
           | out stackexchange for questions like this tho?
        
         | jobtemp wrote:
         | There are phone apps that can pierce through all silent or DND
         | settings. Get one of those. If the same app could buzz on less
         | than 50% battery to remind to charge that would help. Also same
         | app could request to confirm on call status so the don't
         | forget. If they don't confirm someone else gets the shift.
        
       | dclowd9901 wrote:
       | > It reduces alert fatigue by classifying alerts as actionable or
       | noisy and providing contextual information for handling alerts.
       | 
       |  _grimace face_
       | 
       | I might be missing context here, but this kind of problem speaks
       | more to a company's inability to create useful observability, or
       | worse, their lack of conviction around solving noisy alerts
       | (which upon investigation might not even be "just" noise)! Your
       | product is welcome and we can certainly use more competition in
       | this space, but this aspect of it is basically enabling bad
       | cultural practices and I wouldn't highlight it as a main selling
       | point.
        
         | aray07 wrote:
         | Yeah, thats fair feedback. The main aim was to reduce the alert
         | fatigue for on-call engineers and provide a way to get insight
         | into the alerts at the end of the on-call shift.
         | 
         | This way there is data to make a case that certain alerts are
         | noisy (for various reasons) and we should strive to reduce the
         | time spent dealing with these alerts. Fixing some of them might
         | be as easy as deleting them but for others might need dedicated
         | time working on them.
        
       | theodpHN wrote:
       | What you've come up with looks helpful (and may have other
       | applications as someone else noted), but you know what also makes
       | on-call suck less? Getting paid for it, in $ and/or generous comp
       | time. :-)
       | 
       | https://betterstack.com/community/guides/incident-management...
       | 
       | Also helpful is having management that is responsive to bad on-
       | call situations and recognizes when capable, full-time around-
       | the-clock staffing is really needed. It seems too few well-paid
       | tech VPs understand what a 7-Eleven management trainee does,
       | i.e., you shouldn't rely on 1st shift workers to handle all the
       | problems that pop up on 2nd and 3rd shift!
        
         | Aeolun wrote:
         | I guess 7-Eleven management trainees know that their company is
         | just as replacable for their employees as their employees are
         | to them.
        
       | deepfriedbits wrote:
       | Nice job and congratulations on building this! It looks like your
       | copy is missing a word in the first paragraph:
       | 
       | > Opslane is a tool that helps (make) the on-call experience less
       | stressful.
        
         | aray07 wrote:
         | derp, thanks for catching. It has been fixed!
        
       | sanj001 wrote:
       | Using LLMs to classify noisy alerts is a really clever approach
       | to tackling alert fatigue! Are you fine tuning your own model to
       | differentiate between actionable and noisy alerts?
       | 
       | I'm also working on an open source incident management platform
       | called Incidental (https://github.com/incidentalhq/incidental),
       | slightly orthogonal to what you're doing, and it's great to see
       | others addressing these on-call challenges.
       | 
       | Our tech stacks are quite similar too - I'm also using Python 3,
       | FastAPI!
        
         | jobtemp wrote:
         | Why not use statistics? Been reading about xmr charts recently
         | on commoncog. That might help for example.
        
         | aray07 wrote:
         | Thanks for the feedback! I saw the incidental launch on HN and
         | have been following your journey!
        
       | EGreg wrote:
       | One of the "no-bullshit" positions I have arrived at over the
       | years is that "real-time is a gimmick".
       | 
       | You don't need that Times Square ad, only 8-10 people will look
       | up. If you just want the footage of your conspicuous consumotion,
       | you can easily photoshop it for decades already.
       | 
       | Similarly, chat causes anxiety and lack of productivity. Threaded
       | forums like HN are better. Having a system to prevent problems
       | and the rare emergency is better than having everyone glued to
       | their phones 24/7. And frankly, threads keep information better
       | localized AND give people a chance to THINK about the response
       | and iterate before posting in a hurry. When producers of content
       | take their time, this creates efficiencies for EVERY INTERACTION
       | WITH that content later, and effects downstream. (eg my caps lock
       | gaffe above, I wont go back and fix it, will jjst keesp typing
       | 111!1!!!)
       | 
       | Anyway people, so now we come to today's culture. Growing up I
       | had people call and wish happy birthday. Then they posted it on
       | FB. Then FB automated the wishes so you just press a button. Then
       | people automated the thanks by pressing likes. And you can
       | probably make a bot to automate that. What once was a thoughtful
       | gesture has become commoditized with bots talking to bots.
       | 
       | Similar things occurred with resumes and job applications etc.
       | 
       | So I say, you want to know my feedback? Add an AI agent that
       | replies back with basic assurances and questions to whoever
       | "summoned you", have the AI fill out a form, and send you that.
       | The equivalent of front-line call center workers asking "Have you
       | tried turning it on and off again" and "I understand it doesn't
       | work, but how can we replicate it."
       | 
       | That repetitive stuff should he done by AI and build up an FAQ
       | Knowledge Base for bozos and then only bother you if it came
       | across a novel problem it hasn't solved yet, like an emergency
       | because, say, there's a windows BSOD spreading and systems don't
       | boot up. Make the AI do triage and tell the differencd.
        
       | snihalani wrote:
       | can you build a cheaper datadog instead?
        
       | protocolture wrote:
       | I feel like this would be a great tool for people who have had a
       | much better experience of On Call than I have had.
       | 
       | I once worked for a string of businesses that would just send
       | _everything_ to on call unless engineers threatened to quit.
       | Promised automated late night customer sign ups? Haven 't
       | actually invested in the website so that it can do that? Just
       | make the on call engineer do it. Too lazy to hire off shore L1
       | technical support? Just send residential internet support calls
       | to the the On Call engineer! Sell a service that doesn't work in
       | the rain? Just send the on call guy to site every time it rains
       | so he can reconfirm yes, the service sucks. Basic usability
       | questions that could have been resolved during business hours?
       | Does your contract say 24/7 support? Damn, guess thats going to
       | On Call.
       | 
       | Shit even in contracting gigs where I have agreed to be "On Call"
       | for severity 1 emergencies, small business owners will send you
       | things like service turn ups or slow speed issues.
        
       ___________________________________________________________________
       (page generated 2024-07-28 23:01 UTC)