[HN Gopher] Show HN: Root Cause as a Service - Never dig through...
       ___________________________________________________________________
        
       Show HN: Root Cause as a Service - Never dig through logs again
        
       Hey Folks - Larry, Ajay and Rod here!  We address the age old
       painful problem of digging through logs to find the root cause when
       a problem occurs. No-one likes searching through logs, and so we
       spent a few years analyzing 100's of real world incidents to
       understand how humans troubleshoot in logs. And then we built a
       solution that automatically finds the same root cause indicators a
       human would have had to manually search for. We call it Root Cause
       as a Service. RCaaS works with any app and does not require manual
       training or rules. Our foundational thoughts and more details can
       be found here: https://www.zebrium.com/blog/its-time-to-automate-
       the-observer.  Obviously, everyone is skeptical when they hear
       about RCaaS. We encourage you try it yourself, but we also have a
       really strong validation point. One of our customers performed a
       study using 192 actual customer incidents from 4 different products
       and found that Zebrium correctly identified the root cause
       indicators in the logs in over 95% of the incidents - see
       https://www.zebrium.com/cisco-validation.  For those that are
       interested, this is actually our second SHOW HN post, our first was
       last June - https://news.ycombinator.com/item?id=23490609. The link
       in that post points to our current home page but our initial
       comment was, "We're excited to share Zebrium's autonomous incident
       detection software". At the time, our focus was on a tool that used
       unsupervised ML to automatically detect any kind of new or unknown
       software incident. We had done a lot of customer testing and were
       achieving > 90% detection accuracy in catching almost any kind of
       problem. But what we underestimated is just how high the bar is for
       incident detection. If someone is going to hook you up to a pager,
       then even an occasional false positive is enough for a user to
       start cursing your product! And users quickly forget about the
       times when your product saved their bacon by catching problems that
       they would otherwise have missed.  But late last year we had a huge
       aha moment! Most customers already have monitoring tools in place
       that are really good at detecting problems, but what they don't
       have is an automated way to find the root cause. So, we built some
       really elegant integrations for Datadog, New Relic, Elastic,
       Grafana, Dynatrace, AppDynamics and ScienceLogic (and more to come
       via our open APIs) so that when there's a problem, you see details
       of the root cause directly on your monitoring dashboard. Here's a 2
       minute demo of what it looks like: https://youtu.be/t83Egs5l8ok.
       You're welcome to sign-up for a free trial at
       https://www.zebrium.com and we'd love to hear your questions and
       feedback.
        
       Author : stochastimus
       Score  : 17 points
       Date   : 2022-06-17 15:55 UTC (7 hours ago)
        
       | randombits0 wrote:
       | If rnd() > .5 printf("it's DNS!")
        
         | stochastimus wrote:
         | LoLz
        
       | treis wrote:
       | > Here's a 2 minute demo of what it looks like:
       | https://youtu.be/t83Egs5l8ok.
       | 
       | The problem with this demo is that it uses something that's 100%
       | broken due to something that happened immediately before the
       | failure. That's not hard to debug and I don't really see value
       | there.
       | 
       | The scenarios that could use this sort of tool are things like
       | someone turning on a flag that breaks 1% of a specific end point
       | but only 0.1% of overall requests. So something sub-alert level
       | and with not nearly an immediately obvious cause & effect. If you
       | can detect something like that without generating a ton of noise
       | and give a hint to root cause then that'd be something killer.
       | 
       | It's a cool idea and I can see the value. We've had scenarios
       | like the one I mentioned (and worse) go undetected because of the
       | noise Sentry generates. If you can solve that then you've really
       | got something.
        
         | stochastimus wrote:
         | Hey! Founder here. Thanks for your comment!
         | 
         | > That's not hard to debug
         | 
         | I agree there's a large class of problems where that's the
         | case, and I also agree it's perhaps easier than many of the
         | issues we've caught in the real world, such as are discussed in
         | the Cisco case study. But I still want to defend the example as
         | non-trivial.
         | 
         | So of course we pick up the errors in services that are fall-
         | out from the breakage, and we pick up the chaos test running,
         | which leaves a pretty wide footprint. But if something like
         | this happened in the real world, the important event would have
         | been that one that's in the RC report, that's not even an error
         | (though it is quite unusual): the kernel log entry pointing out
         | the eth0 misconfig. It can take a long time for someone to get
         | around to poring through their various host kernel logs and
         | looking at everything, even non-errors, so having it surfaced
         | to you right away feels like a very useful thing.
         | 
         | At the same time though, I like the example you gave even more.
         | What many people will do is upload their own incident data via
         | the CLI, or deploy our chart into a staging environment and
         | just break things, to see what happens. Based on your
         | description, there's a good chance we would pick it up since we
         | track anomalies and perform anomaly correlation across all
         | pairs of streams: we're not looking for a threshold percentage
         | of overall error rates, for example, anywhere. If you'd be
         | willing to help us test this use-case, I'd love to work with
         | you personally to help you get up-and-running.
         | 
         | Also, as we see more and more failure modes, we continue to
         | make our detection algorithms more robust. While we generally
         | achieve >>90% detection of incidents and their root cause
         | indicators overall, there are always places we can and would
         | love to do better. I think you'd find that our solution would
         | catch most of what you'd want out-of-the-box, and that we're
         | responsive enough to learn from every customer.
         | 
         | Lastly, I would ask: do you think that we should do videos of
         | some more "subtle" examples (for lack of a better word)? As
         | someone viewing the website, would you have watched them?
         | 
         | Thanks again for your feedback!
        
         | netfortius wrote:
         | >"... I don't really see value there ... I can see the value."
         | 
         | Huh!?!
        
           | treis wrote:
           | They're referring to different things....
        
       | throwaway81523 wrote:
       | 95.8% of the time it's kind of obvious what happened, at least
       | with reasonable monitoring. Digging through logs is for the other
       | 4.2% of the time. Having done that kind of thing more than once,
       | I don't see ML as being helpful. You often end up writing scripts
       | to search for specific combnations of events, that are only
       | identifiable after the incident has happened.
        
       | kordlessagain wrote:
       | Having worked on a machine learning time series document search
       | solution for the last 2 years, I know exactly why the cost of
       | this is so high. Running logs through a model must be VERY
       | expensive.
       | 
       | I had a good friend at Splunk who passed a few years ago. He was
       | working on something similar, well before we had decent models.
       | His anomaly detection used differences in regular expression
       | patterns to detect "strange things". I guess that's why he
       | carried the title "Chief Mind".
       | 
       | I'm excited where ML and time series data is going. It's going to
       | be interesting!
        
         | stochastimus wrote:
         | It's really quite inexpensive - but would love to get your
         | feedback on pricing, esp. at scale! pls. do reach out
        
       | Huntsecker wrote:
       | do you not take volume of logs into consideration for pricing
       | then ?
        
         | stochastimus wrote:
         | We do indeed since we need to process an amount of data in
         | proportion to that volume, and so more resources are required -
         | you can also run it on-prem, if you want
        
       ___________________________________________________________________
       (page generated 2022-06-17 23:02 UTC)