[HN Gopher] Grafana OnCall: an easy-to-use on-call management tool
___________________________________________________________________
Grafana OnCall: an easy-to-use on-call management tool
Author : sciurus
Score : 161 points
Date : 2021-11-09 17:16 UTC (5 hours ago)
(HTM) web link (grafana.com)
(TXT) w3m dump (grafana.com)
| marcoboffi wrote:
| but is it possible to send sms/phone call directly from grafana
| oncall ? If yes, is there a pricing ?
| markbnj wrote:
| I'm a grafana fan and a current user of PagerDuty. Maybe there's
| more to the story but after reading the post I feel like using a
| calendar integration to manage on-call schedules is the wrong
| approach. Calendar events are a result of overlaying a rotation
| on a date range: they're the output, not the input. I'm sure the
| designers here have looked at how PD enables creating and editing
| rotations. Curious to know their views on it.
| motakuk wrote:
| Hey everyone, Matvey, ex-CEO of Amixr is here. Me and Ildar
| Iskhakov started this project three years ago because we used to
| be on-call ourselves and needed better tools. It was an amazing
| journey from 0 to 1. Tons of coding, first customers,
| fundraising, iterating, and finally the honor to join Grafana
| Labs and build Grafana OnCall! I'll be happy to answer your
| questions if you have any.
| joaoqalves wrote:
| It's great to see more competition in this space. Generally
| speaking, what I miss in these "incident management" products
| is also an integrated, flawless way to handle incidents _when_
| they 're happening. I'm talking about:
|
| 1. Quickly creating a proper chat 2. Quickly creating an
| incident document where you can pin chat messages and use it in
| the post-mortem. Ideally, pinning some graphs that you'd
| extract from your observability solutions 3. Having a status
| page to put a small description for non-technical stakeholders.
|
| PagerDuty covers some of this. Monzo's Response [1] and now
| incident.io [2] try to cover it too. I'd like to have this
| experience end-to-end.
|
| 1 - https://github.com/monzo/response 2 - https://incident.io/
| igetspam wrote:
| I use incident.io. Pretty happy with it. Very responsive
| team.
| hrpnk wrote:
| Monzo's solution does not seem to be actively maintained, is
| it?
|
| +100 on the creation of incident chat rooms and pinning data
| to re-use in incident docs. There is nothing worse than
| copying the timeline events from one tool to a Google Doc.
| joaoqalves wrote:
| AFAIK, the creators created incident.io as a spin-off [1]
| :) Smart move, I must say.
|
| 1 - https://www.indexventures.com/perspectives/incidentio-
| raises...
| SeriousM wrote:
| Hi! Thanks for sharing this news. Will this be available for
| on-premise installations, and when?
| motakuk wrote:
| For now, we are focusing on rolling Grafana OnCall in the
| Grafana Cloud. It's a very common use case to have such a
| system outside of your infrastructure so it won't be affected
| by probable issues. It should be alive even when everything
| goes wrong.
|
| We've already received multiple questions about OSS and on-
| premises. Will roll cloud version first, see how it works,
| collect feedback and build (and share) future plans!
| bilalq wrote:
| This looks really neat. We don't use Grafana today. We're
| running CloudWatch/insights and Squadcast for alerting, but
| deep integration with the monitoring tool looks cool. Is this
| usable with self-hosted or AWS managed Grafana?
| motakuk wrote:
| Yep! The idea of Grafana OnCall is to help you to group,
| deduplicate, route & deliver to Slack/SMS/Phone alerts from
| any sources. It could be a CloudWatch, DataDog, self-hosted
| Alertmanager, or Grafana of course. The only requirement for
| the alert source is to be able to generate a webhook and send
| it to us.
| bilalq wrote:
| Can Grafana OnCall itself be self-hosted and/or run as a
| part of Grafana itself? Your last response makes it sound
| like it's a separate product with integrations rather than
| an extension of Grafana. Is that correct?
| motakuk wrote:
| It's 100% part of the Grafana Cloud, not a separate
| product. It's deeply integrated with the rest of Grafana.
|
| Same time we've focused on making it useful for those who
| don't use Grafana for monitoring. Feel free to sign up in
| the Grafana Cloud and use just OnCall if you want.
| halfmatthalfcat wrote:
| Is there really anybody else in the "Pager" category of SaaS
| products other than PagerDuty that have any traction?
| aiisjustanif wrote:
| xMatters
| bboreham wrote:
| Opsgenie?
| therealdrag0 wrote:
| We use OpsGenie. not sure how widely it's used but given its
| Atlasian I'd guess a non-trivial amount.
| bgm1975 wrote:
| There's Splunk OnCall (formerly known as VictorOps). It's a
| very decent solution.
| bilalq wrote:
| We started using Squadcast: https://squadcast.com
|
| Their free and lower prices tiers offer a lot of what others
| have on their top/most expensive tiers. Also, integrations with
| various alert sources are just easier in most cases. I spent I
| don't know how long trying to get OpsGenie to work before I
| gave up.
| fredman wrote:
| There is xMatters: https://www.xmatters.com/
|
| Disclaimer: I work at xMatters.
| Forfold wrote:
| I work on/for an open source solution that we based off of
| PagerDuty, called GoAlert: https://github.com/target/goalert
| awestman wrote:
| Yep. This is a great product. Has the features you need, is
| super reliable and easy to manage.
| craigching wrote:
| Target uses go alert across the enterprise for all on call.
| Definitely enterprise capable!
| abhishekjha wrote:
| Also what happens if pagerduty goes down?
| jq-r wrote:
| Your service(s) going down and pagerduty going fully down is
| very unlikely to happen. Even if it does, you're probably
| going to get called by customer support because users never
| go down;)
| kevindong wrote:
| In the year I used it, I never personally noticed it going
| down. Although that being said, their SLA is only 99.9%
| delivery in any calendar month within 5 minutes. The penalty
| for missing that SLA is only 10% of that month's bill.
|
| > Once an Incident is triggered, PagerDuty will deliver the
| First Responder Alert within the Notification Delivery Period
| for 99.9% of the notifications sent by PagerDuty for the
| Customer during any calendar month. The "Notification
| Delivery Period" is five (5) minutes and it is measured as
| the time it takes PagerDuty to deliver a First Responder
| Alert to telecommunication providers in accordance with the
| Service configuration and Contact Information.
|
| > ...
|
| > If PagerDuty fails to meet the SLA set forth herein,
| Customer may receive a service credit. Customer will be
| eligible for a credit toward future fees owed to PagerDuty
| for the PagerDuty Service. The Service Credit is calculated
| as ten percent (10%) of the fees paid for or attributable to
| the month when the alleged SLA breach occurred.
|
| https://www.pagerduty.com/standard-service-level-agreement/
| vorpalhex wrote:
| It's very rare for them to go down. I think I can remember
| one major outage during business hours in the last few years
| at which point we just switched to manual monitoring for the
| few hours.
|
| If that is within your outage model, you'd probably want a
| redundant on-call service I suppose, even if it's just
| escalating to a single known email or sms group.
| julianlam wrote:
| Ideally, the services you use should handle that (detect a
| non-200 and fire off a backup method like a slack webhook or
| email.)
|
| In reality, probably a lot of missed downtime events, and ops
| sleeping peacefully I guess.
| armiiller wrote:
| PagerTree - https://pagertree.com
| coderchix wrote:
| My team uses PagerTree. Easy to get started with, has the
| tools you need without being overcomplicated.
| saminzadeh wrote:
| DataDog also launched their own Incident Management tool, not
| sure how widely it's used:
| https://www.datadoghq.com/blog/incident-response-with-datado...
| haliskerbas wrote:
| Technically Splunk On-call. But I have a few pain points with
| it, and I miss pagerduty.
|
| If you want to see what teams you are on as the current logged
| in user, the only way to do it as far as what support told me,
| is to search for yourself and then check that result.
| rconti wrote:
| I see my teams listed under my user profile. Or if I go to
| the left side bar and click on my name, it says when I'm next
| on-call for various teams. But the UI looks different than
| last time I logged in a few weeks ago, so maybe something has
| changed.
|
| Disclaimer: Am an employee.
| dvtrn wrote:
| I've been seeing them recommended more and more, and myself
| have been keeping a passive eye on BetterUptime (which has an
| on-call feature): https://betteruptime.com/incident-management
| moepstar wrote:
| A few more screenshots of the "Scheduling" options would've been
| great...
|
| We're (more or less) using OpsGenie's free tier, however their
| scheduling never really "clicked" with me... not sure if i'm
| special in that regard, however i find the UI/UX pretty...
| weird...
| CSDude wrote:
| > Alerts from each integration 300 5 minutes
|
| > Alerts from the whole team 500 5 minutes
|
| > API requests per API key 300 5 minutes
|
| Product looks great but those API request limits are too low,
| because alerts rain when you are having incidents and rate
| limiting all of them is harmful. That's why other products have
| deduplication keys / aliases so you don't miss important ones.
|
| https://grafana.com/docs/grafana-cloud/oncall/oncall-api-ref...
| named-user wrote:
| How else do you think they are gonna make money?
| CameronNemo wrote:
| _That 's why other products have deduplication keys / aliases
| so you don't miss important ones._
|
| Care to link to the docs? I'm interested.
| CSDude wrote:
| https://support.atlassian.com/opsgenie/docs/what-is-alert-
| de...
|
| https://support.pagerduty.com/docs/event-management
| CameronNemo wrote:
| Thanks for the links.
|
| From the article:
|
| _With Grafana OnCall's automatic grouping of alerts within
| Slack, you can avoid alert storms and reduce the noise your
| teams are exposed to during an incident._
|
| Seems like the same feature described using different
| terminology.
| EwanToo wrote:
| The output alerts feature looks largely the same, but the
| input API limits are the part in question.
|
| What happens if you get 1000 API calls about "Alert 1"
| and 1 API call about "Alert 2".
|
| You want both on call's to trigger once, but will alert 2
| get though?
| deeblering4 wrote:
| I'd think that receiving even 1/5th the rate limit in a 5
| minute window would be disorienting enough to render alerting
| effectively useless.
|
| I'd question the configuration which fires that many alerts in
| that time frame, and suggest improving alert aggregations and
| dependencies to get the number down to one or a handful of
| meaningful alerts.
| curryst wrote:
| The overhead of maintaining those configurations all the time
| is usually too high to be worth it considering the benefit
| and likelihood of reaping it.
|
| Also, in my experience with those systems, they only make
| sense to use very sparingly. Your monitoring becomes
| extremely fragile when your aggregations and dependencies get
| complicated enough that "what will our alerting system do
| when X happens?" results in a flow chart with 18 steps.
|
| If you aren't careful, you can end up making your
| aggregations less useful than the raw alerts would be.
| rmetzler wrote:
| It would be great to have a dependency graph or labels in the
| alerts, so they are easily mapped to the things that can
| break and are important enough to be monitored.
|
| We just had a short outage where an editor removed the index
| page in the cms which is central to the site. It's stupid
| that this is possible but we just operate the cms while we
| build and operate everything around it for our customer.
|
| I think a large part of our alerts where triggered all at
| once but the one thing they had in common was that the alerts
| all pointed to the index page in the cms. E.g. the public www
| alert for index, the public api alert for index, the preview
| www alert for index, the preview api alert for index....
| steveBK123 wrote:
| For a product that's been around 12 years, I've been surprised at
| how minimally featured PagerDuty is.
|
| Stuff like national holiday awareness, integration to vacation
| calendars, a better UI for swapping days/overrides, etc.
|
| PD schedule checking and trade negotiation becomes yet another
| thing in the long list of things I need to do when taking a day
| off. HR system request off, Department Outlook calendar update,
| PagerDuty coverage check, Outlook out-of-office status & auto-
| replies, Slack set away, update status AND pause notifications.
|
| I suppose that's because as an on-call developer I am not the
| user. The user, management who bought the product, gets KPIs &
| pretty graphs, so they are happy.
| ethbr0 wrote:
| Every delightful, successful developer product is eventually
| doomed to become JIRA.
| rvnx wrote:
| A multi-billion USD success story ?
| ethbr0 wrote:
| That's one way to look at it.
___________________________________________________________________
(page generated 2021-11-09 23:00 UTC)