[HN Gopher] Good and Bad Monitoring
       ___________________________________________________________________
        
       Good and Bad Monitoring
        
       Author : kiyanwang
       Score  : 69 points
       Date   : 2021-07-11 11:14 UTC (2 days ago)
        
 (HTM) web link (raynorelyp.medium.com)
 (TXT) w3m dump (raynorelyp.medium.com)
        
       | devchix wrote:
       | If it's like pulling teeth to motivate your users to use
       | dashboards, you're building useless dashboards.
       | 
       | Dashboards are great for at-a-glance metrics roll-up. Build
       | small, single-page, targeted dash that answers questions your
       | user asks, and they'll be used. I want to see the lay of the land
       | and know I'm heading into the weeds, instead of being kicked in
       | the shin by a monitoring alert when I'm already in the weeds.
        
         | retzkek wrote:
         | To elaborate on this, dashboards are where you go when you get
         | an alert, to answer questions like:
         | 
         | - how widespread is it?
         | 
         | - who's impacted?
         | 
         | - is the alert the root cause, or just a symptom?
         | 
         | Dashboards should tell a story, not just be a bunch of graphs
         | squeezed onto a page. There should be links to drill-down to
         | more detailed dashboards, logs, and traces, to make it as fast
         | and easy as possible to find the fire when you smell smoke,
         | even for someone who's on their first week.
         | 
         | Most dashboards, unfortunately, _are_ useless. But then those
         | have a place too: hanging on a wall somewhere, to show people
         | how not-useless we are.
        
       | Zealotux wrote:
       | >HTTP 400 level errors on their own do not indicate a problem.
       | 
       | ...on the back-end, but they can help find issues with the front-
       | end. For example: I have an endpoint my front-end calls in the
       | background to update a preview for the user, I found out that
       | some changes in the front-end made all calls invalid (but only in
       | production), monitoring 400 errors helped me figure out the issue
       | very quickly before a customer even complained about it.
       | Sometimes, "the client screwed up" because of you, if you have
       | 90% of your clients messing up in the exact same way it may
       | actually be your fault.
        
       | sokoloff wrote:
       | > Eventually, you will add service F and no one will remember to
       | go into the monitoring service and add it, but they will see the
       | tags on the other lambdas and tag the new one correctly.
       | 
       | This does not match my intuition not my experience, unless
       | there's automation to check or enforce it.
        
       | theginger wrote:
       | Inflated error rates due to invalid 500 errors is definitely a
       | thing I've seen it at my last 2 jobs. I think it comes from a
       | lack of confidence with developers when it comes to setting the
       | http status code to something different. It starts with a genuine
       | 500 error caused by an unexpected invalid request the app can't
       | handle and is crashing or some sort of bad behaviour, they put in
       | some sort of exception handling so now it can handle it, they now
       | have code that solves the bug that is impacting the server, but
       | they've still got to return something to the clients request
       | which is still not valid. A 400 error would almost always be
       | appropriate, or perhaps another more specific 4xx would be
       | appropriate but 500 is what was already being returned. Anything
       | else is a change which might impact the client in an unexpected
       | way. It takes a lot of confidence to make a potentially breaking
       | change that goes beyond merely fixing a bug even when you are
       | reasonably certain it's the right change. Once you've got it in
       | your code base a few times to return a 500 after you've caught an
       | exception it starts to set a president that others will follow
       | the example.
        
       | lxe wrote:
       | HTTP status codes are like... suggestions at this point.
        
       | indigodaddy wrote:
       | Yes, much prefer a workflow where alerts that actually need to be
       | looked at just come into an alert-specific Slack channel with
       | some pretty decent basic info. We did it this way at my last job
       | with Datadog/Slack hooks. It was easy to setup and worked great.
       | Staring at dashboards or even checking them every hour or
       | whatever makes little sense.
        
       | lxe wrote:
       | Elevated 413 or 413 errors could mean that a bug was introduced
       | to the client/frontend that sends large payloads or cookies.
       | 
       | Elevated 400, 401, or 403 errors could mean that a bug was
       | introduced in session or cookie handling middleware, client, or
       | server code.
       | 
       | Elevated 200s means it could be a DDOS attack or issues with
       | client-side polling.
       | 
       | Etc..
       | 
       | Alert on status code anomalies, not on volume/percentage of
       | certain status code.
        
       | Crash0v3rid3 wrote:
       | > I started with a system where another team would publish data
       | to us in an eventing architecture and would frequently publish
       | corrupt data. It was my team's responsibility to address anytime
       | data was not ingested correctly into our system. As a result, we
       | had floods of errors in our system. We tried asking them to stop
       | and they said no.
       | 
       | In this particular instance, I would simply respond to the caller
       | with an appropriate error code and be on my way. The other team
       | should be responsible with dealing with such an issue. The writer
       | implies they had no choice but I don't buy it, if you design the
       | system in such a way to not allow corrupt data to begin with, it
       | becomes the callers responsibility to handle these issues.
        
       | jedberg wrote:
       | The most important thing that was missed:
       | 
       | Good: Alerts on business level metrics
       | 
       | Bad: Alerts on machine level metrics
       | 
       | Knowing that checkout volume is sharply down is far more valuable
       | than knowing that CPU on one of the checkout servers is way up.
       | Mainly because that high CPU may have no customer effect, so it's
       | really not all that urgent.
        
         | Phelinofist wrote:
         | Why not both? Scraping machine level metrics works out of the
         | box with most agents
        
           | jedberg wrote:
           | It's fine to scrape the metrics, but what I'm saying is don't
           | alert on them by default until you are sure that a particular
           | server metric is actually a good alert.
        
         | cassianoleal wrote:
         | > Good: Alerts on business level metrics
         | 
         | I agree partially. I would make it more general though: alert
         | on symptoms of problems. Those can be business metrics, like
         | the one you suggested in your example, or they can be system
         | level, like rate of errors, or a queue that's growing out of
         | control.
         | 
         | > Bad: Alerts on machine level metrics
         | 
         | 100% agree. There is no excuse for that. A CPU working overtime
         | with no customer impact (no symptoms of problems) is an
         | efficient system. I'm paying for that CPU, I'd like to use it.
         | If I get an alert every time I use something I pay for that
         | will only drive me to pay more so it shuts up - even though
         | there was no problem to begin with.
        
         | lxe wrote:
         | Yes. And also make business metrics somehow traceable to
         | machine metrics.
        
       | painchoc wrote:
       | During a RCA, you find a specific error message associated with
       | that incident. You deliver a new alert with some documentation
       | about what it catches and what to do. You even generate
       | automatically a ticket when it is raised. Time passes. There's a
       | subtle change if the error message. You have another production
       | incident but your alert hasn't fired. The complexity comes from
       | this: how do you know that an alert is still valid without
       | creating an incident on purpose ?
        
       | jedberg wrote:
       | One of the most useful things we did was add a button to every
       | alert we sent that said "Was this alert useful: Yes/No".
       | 
       | We would then send the alert creator reports what percent of
       | recipients said yes. That alone got people to realize that a lot
       | of their alerts were unnecessary and get rid of them. As a bonus,
       | the most useful alerts actually got subscriptions from other
       | people on other teams because it was such a useful indicator.
        
       | cle wrote:
       | > Bad: HTTP 400 > On the other hand, HTTP 400 level errors mean
       | the client screwed up.
       | 
       | This is bad general advice. HTTP 4xx errors mean the client
       | screwed up, OR you screwed up (a change that e.g. increases 404
       | rate due to eventual consistency, returns 404 for all content,
       | breaks auth, returns the wrong status code, etc.). Either way the
       | content is inaccessible to the client, the person visiting your
       | website doesn't care if they get a 404 or a 502, they care that
       | the content is inaccessible. Once you get high enough traffic,
       | monitoring 4xx rate is pretty critical to making sure people can
       | actually use your service. (Or monitor the inverse, i.e. a floor
       | on 2xx rate instead of a ceiling on {4,5}xx rate.)
        
         | advisedwang wrote:
         | Monitor the general 5xx error rate so you have high SNR.
         | 
         | Cover mistakes with robust probers that should get 200 and then
         | alert on any non-200 response.
        
         | breischl wrote:
         | In the context of alerting, I agree with TFA. You should not be
         | alerting on bad request errors, because you might have no
         | control over it. That said you might want monitoring on it so
         | you can check if the rate jumped at some important point (eg,
         | after a deployment) but I wouldn't look at it on a regular
         | basis.
         | 
         | I had something like that on an internal system. The 400 rate
         | would jump all over the place because our edge systems had
         | shitty input validation, and bots would crawl us with broken
         | requests ("can I reserve this item starting last week?" kind of
         | thing) with no rate throttling. After a few years the edge
         | validation (and bot detection) got better, but alerting on that
         | would've been worse than useless.
        
           | cle wrote:
           | Yeah I agree that false positives are a risk with monitoring
           | 4xx rate. I've never seen a satisfactory "bulletproof" way to
           | monitor 4xx rate, it's inherently difficult and
           | simultaneously important to monitor.
           | 
           | It's easy in retrospect to say "oh that was a waste of time
           | because it was just bots" but you don't know that until you
           | investigate. I ask myself "if I see elevated 4xx's, at what
           | point do I start to care if they're caused by a bug?" and set
           | monitor thresholds somewhere around there.
        
         | k__ wrote:
         | This.
         | 
         | In my experience, 4xx and 5xx are only valuable to find the
         | right place to look, but in no way do indicate if client or
         | server failed.
        
       ___________________________________________________________________
       (page generated 2021-07-13 23:01 UTC)