https://www.zebrium.com/blog/real-world-examples-of-gpt-3-plain-language-root-cause-summaries-zebrium

Webinar: Using Machine Learning on Logs to Find Root Cause Faster

Register Now
zebrium logo horizontal 170x42

  * Product
      + Overview
      + Log management
      + How it works
      + Integrations
  * Solutions
      + Elastic Stack
      + Kubernetes
      + IT Service Providers
  * Pricing
  * Resources
      + General Videos
      + How to Videos
      + Webinars
      + Docs
  * Blog
  * Company
      + Customer Case Studies
  * Sign-in

Get Started Free

Real World Examples of GPT-3 Plain Language Root Cause Summaries

Ajay Singh

Plain language Root Cause process

 

A few weeks ago, Larry our CTO, wrote about a new beta feature
leveraging the GPT-3 language model - Using GPT-3 for plain language
incident root cause from logs. To recap - Zebrium's unsupervised ML
identifies the root cause of incidents and generates concise reports
(typically between 5-20 log events) identifying the first event in
the sequence (typically the root cause), worst symptom, other
associated events and correlated metrics anomalies.

 

Root cause report summary-1

 

As Larry pointed out, this works well for developers who are familiar
with the logs, but can be hard to digest if an SRE or frontline ops
engineer isn't familiar with the application internals. The GPT-3
integration allows us to take the next step - distill these root
cause reports down to concise natural language summaries by scanning
the entire internet for occurrences of a similar incident, and
extracting brief "English" descriptions for a user to scan.

 

After a few weeks of beta testing this feature with a limited group,
and examining results from a couple of hundred incidents, we're now
ready to share some exciting results and expand access to ALL Zebrium
users, even those on free trials.

 

In a nutshell - it works so well and in such a wide range of
scenarios that we felt most users would benefit from having access to
it. These summaries are both accurate and truly useful - distilling
log events into a description a frontline or experienced engineer can
easily understand.

First the caveats

 This is still an early-stage feature for us, and there are cases
where GPT-3 veers into guesswork and suggests summaries that seem
related to the core RCA report, but aren't exactly right. To make
sure users know this, we tag the summaries with an "EXPERIMENTAL"
badge in the UI.

 

There are also times the specific RCA report does not generate a
particularly illuminating natural language summary beyond recapping
the key log event(s). For instance -

 

  * The first log message indicates that the node was not responding.
  * The first log message is a fatal error indicating that the
    program detected a problem and aborted.
  * The client sent a request to the server, but the server did not
    respond.
  * The first log message shows that the **** task took *** ms to
    complete.

 

There are several possible reasons for these suboptimal outcomes. One
possibility is that there simply aren't enough examples of that type
of issue in the public domain, so GPT-3 is responding with the
closest details it can find. Another is that we haven't yet explored
all the variants of prompts and options we can use with the GPT-3
model.

 

The good news is that even when results are suboptimal, they are
mostly not misleading and are easily ignored. More importantly, our
ML-generated root cause summaries are the perfect input source for
GPT-3, and with more work, the outcomes will only get better from
here.

Now for the good part

The great news is that it actually works well more often than not,
and the results are actually quite useful. Here are some examples
where the GPT-3 summary really described the event collection
accurately, and was really helpful to the user to quickly digest the
RCA. Note: we have obfuscated details that might be potentially
sensitive, and we're not sharing the raw log events for the same
reason, although they would be useful to compare alongside the
summaries.

 

As a first bucket, here are some interesting and useful incident
summaries related to memory starvation:

 

  * The **** daemon was unable to allocate memory for the **** data.
  * The RabbitMQ broker was running out of memory.
  * The **** process was killed by the OOM killer.
  * The application was trying to fork a child process, but the fork
    call failed because the system was out of memory.
  * The memcached process was killed by the OOM killer because it was
    consuming too much memory. The OOM killer is a Linux kernel
    feature that allows the system to kill a process when the system
    is running out of memory.
  * The memory cgroup was out of memory, so the kernel killed process
    **** and its child ****.

 

Then, here are some other infrastructure related incidents:

 

  * The AWS Machine Image AMI  that the customer used was for an
    x86_64 architecture, but the customer s instance type was arm64.
  * The file system was corrupted.
  * The Kafka producer was not configured to be a leader for the
    topic-partition.
  * The root cause of the issue was that the Jenkins master was not
    able to connect to the vCenter server.
  * The root cause was a drive failed error
  * The cluster was under heavy load, and the scheduler was unable to
    schedule the pod.
  * The pod was scheduled on a node that had insufficient CPU.
  * The root cause was that the Slack API was rate limited.

 

For variety, here are some database related incidents:

 

  * The database was closed, and the query failed.
  * The EC2 instance was running out of connections to the database.
  * The database driver was unable to ping the database server.
  * The first message is a SQL error, which means that the database
    server was unable to execute the query.

 

Finally, here are some examples of security related incident
summaries:

 

  * The LDAP server was not responding to the client.
  * The root cause of the issue was that the certificate chain did
    not match any of the trust anchors.
  * The root cause of the problem was that the sshd daemon on the
    server was configured to allow only three authentication attempts
    per IP address.
  * The server rejected the connection because it has already seen
    too many invalid authentication attempts for that user.

Summary

Our focus is to cut troubleshooting time using machine learning to
summarize the key event sequences that describe an incident based on
logs and associated metrics anomalies. The GPT-3 integration is a big
step towards our goals - enabling quick review of RCA reports by
anyone, even personnel who may not be intimately familiar with
application internals. As described above - there are still
improvements to be made, but it works so well in real world scenarios
that we are now opening it up to all our users.

 

Try it for yourself by signing up for a free trial.

 

  * Tweet
  * 
  * 

Recent Posts

  * Real World Examples of GPT-3 Plain Language Root Cause Summaries
    | Zebrium March 23, 2021
  * The Root Cause Experience | Zebrium February 22, 2021
  * Lessons from Slack, GCP and Snowflake outages | Zebrium February
    4, 2021
  * Using GPT-3 for plain language incident root cause from logs |
    Zebrium January 9, 2021
  * Try ML-Driven RCA using a microservices demo app | Zebrium 
    December 16, 2020
  * ZELK vs ELK: Zebrium ML vs Elastic Machine Learning | Zebrium 
    October 25, 2020
  * Zebrium Named a 2020 Gartner Cool Vendor | AIOps October 22, 2020
  * A new machine learning approach for your Elastic Stack October
    16, 2020
  * A simpler alternative to distributed tracing for troubleshooting 
    July 21, 2020
  * Zebrium can augment PagerDuty incidents | Zebrium July 17, 2020

FREE SIGN-UP

 

Tags

  * AI (1)
  * anomaly detection (3)
  * autonomous log monitoring (4)
  * autonomous monitoring (6)
  * ci/cd forensics (1)
  * continuous delivery (1)
  * customer experience (1)
  * dev/test (1)
  * dev/test forensics (1)
  * devops (11)
  * engineering analytics (1)
  * fluentd (1)
  * incident augmentation (2)
  * incident detection (2)
  * incident response (2)
  * k8s (7)
  * kubernetes (6)
  * log anomaly detection (6)
  * log files (6)
  * logs (5)
  * machine data (3)
  * machine learning (4)
  * metrics anomaly detection (5)
  * monitoring (3)
  * observability (8)
  * open source (1)
  * predictive support (1)
  * predictive troubleshooting (1)
  * product analytics (1)
  * prometheus (1)
  * root cause (2)
  * software incident (2)
  * structure (1)
  * structured data (2)
  * support automation (2)
  * troubleshooting (2)
  * unstructured data (1)
  * User Experience (1)

See all

Archive

  * March 2021 (1)
  * February 2021 (2)
  * January 2021 (1)
  * December 2020 (1)
  * October 2020 (3)
  * July 2020 (3)
  * June 2020 (1)
  * May 2020 (3)
  * April 2020 (2)
  * March 2020 (5)
  * February 2020 (1)
  * January 2020 (3)
  * December 2019 (1)
  * November 2019 (2)
  * October 2019 (3)
  * August 2019 (1)
  * July 2019 (2)
  * June 2019 (1)
  * May 2019 (2)
  * February 2019 (1)
  * December 2018 (1)
  * October 2018 (1)

See all

Search By Tags

  * devops (11)
  * observability (8)
  * k8s (7)
  * autonomous monitoring (6)
  * kubernetes (6)
  * log anomaly detection (6)
  * log files (6)
  * logs (5)
  * metrics anomaly detection (5)
  * autonomous log monitoring (4)
  * machine learning (4)
  * anomaly detection (3)
  * machine data (3)
  * monitoring (3)
  * incident augmentation (2)
  * incident detection (2)
  * incident response (2)
  * root cause (2)
  * software incident (2)
  * structured data (2)
  * support automation (2)
  * troubleshooting (2)
  * AI (1)
  * User Experience (1)
  * ci/cd forensics (1)
  * continuous delivery (1)
  * customer experience (1)
  * dev/test (1)
  * dev/test forensics (1)
  * engineering analytics (1)
  * fluentd (1)
  * open source (1)
  * predictive support (1)
  * predictive troubleshooting (1)
  * product analytics (1)
  * prometheus (1)
  * structure (1)
  * unstructured data (1)

See all

Archive

  * March 2021 (1)
  * February 2021 (2)
  * January 2021 (1)
  * December 2020 (1)
  * October 2020 (3)
  * July 2020 (3)
  * June 2020 (1)
  * May 2020 (3)
  * April 2020 (2)
  * March 2020 (5)
  * February 2020 (1)
  * January 2020 (3)
  * December 2019 (1)
  * November 2019 (2)
  * October 2019 (3)
  * August 2019 (1)
  * July 2019 (2)
  * June 2019 (1)
  * May 2019 (2)
  * February 2019 (1)
  * December 2018 (1)
  * October 2018 (1)

See all
logo-footer

Links

  * Product
  * Videos
  * Blog
  * Docs
  * Company

Contact

hello@zebrium.com

careers@zebrium.com

Subscribe to newsletter

Privacy Policy        Terms of Service        (c) 2021 by Zebrium, Inc.

*