[HN Gopher] Good and Bad Monitoring
___________________________________________________________________
Good and Bad Monitoring
Author : kiyanwang
Score : 69 points
Date : 2021-07-11 11:14 UTC (2 days ago)
(HTM) web link (raynorelyp.medium.com)
(TXT) w3m dump (raynorelyp.medium.com)
| devchix wrote:
| If it's like pulling teeth to motivate your users to use
| dashboards, you're building useless dashboards.
|
| Dashboards are great for at-a-glance metrics roll-up. Build
| small, single-page, targeted dash that answers questions your
| user asks, and they'll be used. I want to see the lay of the land
| and know I'm heading into the weeds, instead of being kicked in
| the shin by a monitoring alert when I'm already in the weeds.
| retzkek wrote:
| To elaborate on this, dashboards are where you go when you get
| an alert, to answer questions like:
|
| - how widespread is it?
|
| - who's impacted?
|
| - is the alert the root cause, or just a symptom?
|
| Dashboards should tell a story, not just be a bunch of graphs
| squeezed onto a page. There should be links to drill-down to
| more detailed dashboards, logs, and traces, to make it as fast
| and easy as possible to find the fire when you smell smoke,
| even for someone who's on their first week.
|
| Most dashboards, unfortunately, _are_ useless. But then those
| have a place too: hanging on a wall somewhere, to show people
| how not-useless we are.
| Zealotux wrote:
| >HTTP 400 level errors on their own do not indicate a problem.
|
| ...on the back-end, but they can help find issues with the front-
| end. For example: I have an endpoint my front-end calls in the
| background to update a preview for the user, I found out that
| some changes in the front-end made all calls invalid (but only in
| production), monitoring 400 errors helped me figure out the issue
| very quickly before a customer even complained about it.
| Sometimes, "the client screwed up" because of you, if you have
| 90% of your clients messing up in the exact same way it may
| actually be your fault.
| sokoloff wrote:
| > Eventually, you will add service F and no one will remember to
| go into the monitoring service and add it, but they will see the
| tags on the other lambdas and tag the new one correctly.
|
| This does not match my intuition not my experience, unless
| there's automation to check or enforce it.
| theginger wrote:
| Inflated error rates due to invalid 500 errors is definitely a
| thing I've seen it at my last 2 jobs. I think it comes from a
| lack of confidence with developers when it comes to setting the
| http status code to something different. It starts with a genuine
| 500 error caused by an unexpected invalid request the app can't
| handle and is crashing or some sort of bad behaviour, they put in
| some sort of exception handling so now it can handle it, they now
| have code that solves the bug that is impacting the server, but
| they've still got to return something to the clients request
| which is still not valid. A 400 error would almost always be
| appropriate, or perhaps another more specific 4xx would be
| appropriate but 500 is what was already being returned. Anything
| else is a change which might impact the client in an unexpected
| way. It takes a lot of confidence to make a potentially breaking
| change that goes beyond merely fixing a bug even when you are
| reasonably certain it's the right change. Once you've got it in
| your code base a few times to return a 500 after you've caught an
| exception it starts to set a president that others will follow
| the example.
| lxe wrote:
| HTTP status codes are like... suggestions at this point.
| indigodaddy wrote:
| Yes, much prefer a workflow where alerts that actually need to be
| looked at just come into an alert-specific Slack channel with
| some pretty decent basic info. We did it this way at my last job
| with Datadog/Slack hooks. It was easy to setup and worked great.
| Staring at dashboards or even checking them every hour or
| whatever makes little sense.
| lxe wrote:
| Elevated 413 or 413 errors could mean that a bug was introduced
| to the client/frontend that sends large payloads or cookies.
|
| Elevated 400, 401, or 403 errors could mean that a bug was
| introduced in session or cookie handling middleware, client, or
| server code.
|
| Elevated 200s means it could be a DDOS attack or issues with
| client-side polling.
|
| Etc..
|
| Alert on status code anomalies, not on volume/percentage of
| certain status code.
| Crash0v3rid3 wrote:
| > I started with a system where another team would publish data
| to us in an eventing architecture and would frequently publish
| corrupt data. It was my team's responsibility to address anytime
| data was not ingested correctly into our system. As a result, we
| had floods of errors in our system. We tried asking them to stop
| and they said no.
|
| In this particular instance, I would simply respond to the caller
| with an appropriate error code and be on my way. The other team
| should be responsible with dealing with such an issue. The writer
| implies they had no choice but I don't buy it, if you design the
| system in such a way to not allow corrupt data to begin with, it
| becomes the callers responsibility to handle these issues.
| jedberg wrote:
| The most important thing that was missed:
|
| Good: Alerts on business level metrics
|
| Bad: Alerts on machine level metrics
|
| Knowing that checkout volume is sharply down is far more valuable
| than knowing that CPU on one of the checkout servers is way up.
| Mainly because that high CPU may have no customer effect, so it's
| really not all that urgent.
| Phelinofist wrote:
| Why not both? Scraping machine level metrics works out of the
| box with most agents
| jedberg wrote:
| It's fine to scrape the metrics, but what I'm saying is don't
| alert on them by default until you are sure that a particular
| server metric is actually a good alert.
| cassianoleal wrote:
| > Good: Alerts on business level metrics
|
| I agree partially. I would make it more general though: alert
| on symptoms of problems. Those can be business metrics, like
| the one you suggested in your example, or they can be system
| level, like rate of errors, or a queue that's growing out of
| control.
|
| > Bad: Alerts on machine level metrics
|
| 100% agree. There is no excuse for that. A CPU working overtime
| with no customer impact (no symptoms of problems) is an
| efficient system. I'm paying for that CPU, I'd like to use it.
| If I get an alert every time I use something I pay for that
| will only drive me to pay more so it shuts up - even though
| there was no problem to begin with.
| lxe wrote:
| Yes. And also make business metrics somehow traceable to
| machine metrics.
| painchoc wrote:
| During a RCA, you find a specific error message associated with
| that incident. You deliver a new alert with some documentation
| about what it catches and what to do. You even generate
| automatically a ticket when it is raised. Time passes. There's a
| subtle change if the error message. You have another production
| incident but your alert hasn't fired. The complexity comes from
| this: how do you know that an alert is still valid without
| creating an incident on purpose ?
| jedberg wrote:
| One of the most useful things we did was add a button to every
| alert we sent that said "Was this alert useful: Yes/No".
|
| We would then send the alert creator reports what percent of
| recipients said yes. That alone got people to realize that a lot
| of their alerts were unnecessary and get rid of them. As a bonus,
| the most useful alerts actually got subscriptions from other
| people on other teams because it was such a useful indicator.
| cle wrote:
| > Bad: HTTP 400 > On the other hand, HTTP 400 level errors mean
| the client screwed up.
|
| This is bad general advice. HTTP 4xx errors mean the client
| screwed up, OR you screwed up (a change that e.g. increases 404
| rate due to eventual consistency, returns 404 for all content,
| breaks auth, returns the wrong status code, etc.). Either way the
| content is inaccessible to the client, the person visiting your
| website doesn't care if they get a 404 or a 502, they care that
| the content is inaccessible. Once you get high enough traffic,
| monitoring 4xx rate is pretty critical to making sure people can
| actually use your service. (Or monitor the inverse, i.e. a floor
| on 2xx rate instead of a ceiling on {4,5}xx rate.)
| advisedwang wrote:
| Monitor the general 5xx error rate so you have high SNR.
|
| Cover mistakes with robust probers that should get 200 and then
| alert on any non-200 response.
| breischl wrote:
| In the context of alerting, I agree with TFA. You should not be
| alerting on bad request errors, because you might have no
| control over it. That said you might want monitoring on it so
| you can check if the rate jumped at some important point (eg,
| after a deployment) but I wouldn't look at it on a regular
| basis.
|
| I had something like that on an internal system. The 400 rate
| would jump all over the place because our edge systems had
| shitty input validation, and bots would crawl us with broken
| requests ("can I reserve this item starting last week?" kind of
| thing) with no rate throttling. After a few years the edge
| validation (and bot detection) got better, but alerting on that
| would've been worse than useless.
| cle wrote:
| Yeah I agree that false positives are a risk with monitoring
| 4xx rate. I've never seen a satisfactory "bulletproof" way to
| monitor 4xx rate, it's inherently difficult and
| simultaneously important to monitor.
|
| It's easy in retrospect to say "oh that was a waste of time
| because it was just bots" but you don't know that until you
| investigate. I ask myself "if I see elevated 4xx's, at what
| point do I start to care if they're caused by a bug?" and set
| monitor thresholds somewhere around there.
| k__ wrote:
| This.
|
| In my experience, 4xx and 5xx are only valuable to find the
| right place to look, but in no way do indicate if client or
| server failed.
___________________________________________________________________
(page generated 2021-07-13 23:01 UTC)