https://www.zebrium.com/blog/real-world-examples-of-gpt-3-plain-language-root-cause-summaries-zebrium Webinar: Using Machine Learning on Logs to Find Root Cause Faster Register Now zebrium logo horizontal 170x42 * Product + Overview + Log management + How it works + Integrations * Solutions + Elastic Stack + Kubernetes + IT Service Providers * Pricing * Resources + General Videos + How to Videos + Webinars + Docs * Blog * Company + Customer Case Studies * Sign-in Get Started Free Real World Examples of GPT-3 Plain Language Root Cause Summaries Ajay Singh Plain language Root Cause process A few weeks ago, Larry our CTO, wrote about a new beta feature leveraging the GPT-3 language model - Using GPT-3 for plain language incident root cause from logs. To recap - Zebrium's unsupervised ML identifies the root cause of incidents and generates concise reports (typically between 5-20 log events) identifying the first event in the sequence (typically the root cause), worst symptom, other associated events and correlated metrics anomalies. Root cause report summary-1 As Larry pointed out, this works well for developers who are familiar with the logs, but can be hard to digest if an SRE or frontline ops engineer isn't familiar with the application internals. The GPT-3 integration allows us to take the next step - distill these root cause reports down to concise natural language summaries by scanning the entire internet for occurrences of a similar incident, and extracting brief "English" descriptions for a user to scan. After a few weeks of beta testing this feature with a limited group, and examining results from a couple of hundred incidents, we're now ready to share some exciting results and expand access to ALL Zebrium users, even those on free trials. In a nutshell - it works so well and in such a wide range of scenarios that we felt most users would benefit from having access to it. These summaries are both accurate and truly useful - distilling log events into a description a frontline or experienced engineer can easily understand. First the caveats This is still an early-stage feature for us, and there are cases where GPT-3 veers into guesswork and suggests summaries that seem related to the core RCA report, but aren't exactly right. To make sure users know this, we tag the summaries with an "EXPERIMENTAL" badge in the UI. There are also times the specific RCA report does not generate a particularly illuminating natural language summary beyond recapping the key log event(s). For instance - * The first log message indicates that the node was not responding. * The first log message is a fatal error indicating that the program detected a problem and aborted. * The client sent a request to the server, but the server did not respond. * The first log message shows that the **** task took *** ms to complete. There are several possible reasons for these suboptimal outcomes. One possibility is that there simply aren't enough examples of that type of issue in the public domain, so GPT-3 is responding with the closest details it can find. Another is that we haven't yet explored all the variants of prompts and options we can use with the GPT-3 model. The good news is that even when results are suboptimal, they are mostly not misleading and are easily ignored. More importantly, our ML-generated root cause summaries are the perfect input source for GPT-3, and with more work, the outcomes will only get better from here. Now for the good part The great news is that it actually works well more often than not, and the results are actually quite useful. Here are some examples where the GPT-3 summary really described the event collection accurately, and was really helpful to the user to quickly digest the RCA. Note: we have obfuscated details that might be potentially sensitive, and we're not sharing the raw log events for the same reason, although they would be useful to compare alongside the summaries. As a first bucket, here are some interesting and useful incident summaries related to memory starvation: * The **** daemon was unable to allocate memory for the **** data. * The RabbitMQ broker was running out of memory. * The **** process was killed by the OOM killer. * The application was trying to fork a child process, but the fork call failed because the system was out of memory. * The memcached process was killed by the OOM killer because it was consuming too much memory. The OOM killer is a Linux kernel feature that allows the system to kill a process when the system is running out of memory. * The memory cgroup was out of memory, so the kernel killed process **** and its child ****. Then, here are some other infrastructure related incidents: * The AWS Machine Image AMI that the customer used was for an x86_64 architecture, but the customer s instance type was arm64. * The file system was corrupted. * The Kafka producer was not configured to be a leader for the topic-partition. * The root cause of the issue was that the Jenkins master was not able to connect to the vCenter server. * The root cause was a drive failed error * The cluster was under heavy load, and the scheduler was unable to schedule the pod. * The pod was scheduled on a node that had insufficient CPU. * The root cause was that the Slack API was rate limited. For variety, here are some database related incidents: * The database was closed, and the query failed. * The EC2 instance was running out of connections to the database. * The database driver was unable to ping the database server. * The first message is a SQL error, which means that the database server was unable to execute the query. Finally, here are some examples of security related incident summaries: * The LDAP server was not responding to the client. * The root cause of the issue was that the certificate chain did not match any of the trust anchors. * The root cause of the problem was that the sshd daemon on the server was configured to allow only three authentication attempts per IP address. * The server rejected the connection because it has already seen too many invalid authentication attempts for that user. Summary Our focus is to cut troubleshooting time using machine learning to summarize the key event sequences that describe an incident based on logs and associated metrics anomalies. The GPT-3 integration is a big step towards our goals - enabling quick review of RCA reports by anyone, even personnel who may not be intimately familiar with application internals. As described above - there are still improvements to be made, but it works so well in real world scenarios that we are now opening it up to all our users. Try it for yourself by signing up for a free trial. * Tweet * * Recent Posts * Real World Examples of GPT-3 Plain Language Root Cause Summaries | Zebrium March 23, 2021 * The Root Cause Experience | Zebrium February 22, 2021 * Lessons from Slack, GCP and Snowflake outages | Zebrium February 4, 2021 * Using GPT-3 for plain language incident root cause from logs | Zebrium January 9, 2021 * Try ML-Driven RCA using a microservices demo app | Zebrium December 16, 2020 * ZELK vs ELK: Zebrium ML vs Elastic Machine Learning | Zebrium October 25, 2020 * Zebrium Named a 2020 Gartner Cool Vendor | AIOps October 22, 2020 * A new machine learning approach for your Elastic Stack October 16, 2020 * A simpler alternative to distributed tracing for troubleshooting July 21, 2020 * Zebrium can augment PagerDuty incidents | Zebrium July 17, 2020 FREE SIGN-UP Tags * AI (1) * anomaly detection (3) * autonomous log monitoring (4) * autonomous monitoring (6) * ci/cd forensics (1) * continuous delivery (1) * customer experience (1) * dev/test (1) * dev/test forensics (1) * devops (11) * engineering analytics (1) * fluentd (1) * incident augmentation (2) * incident detection (2) * incident response (2) * k8s (7) * kubernetes (6) * log anomaly detection (6) * log files (6) * logs (5) * machine data (3) * machine learning (4) * metrics anomaly detection (5) * monitoring (3) * observability (8) * open source (1) * predictive support (1) * predictive troubleshooting (1) * product analytics (1) * prometheus (1) * root cause (2) * software incident (2) * structure (1) * structured data (2) * support automation (2) * troubleshooting (2) * unstructured data (1) * User Experience (1) See all Archive * March 2021 (1) * February 2021 (2) * January 2021 (1) * December 2020 (1) * October 2020 (3) * July 2020 (3) * June 2020 (1) * May 2020 (3) * April 2020 (2) * March 2020 (5) * February 2020 (1) * January 2020 (3) * December 2019 (1) * November 2019 (2) * October 2019 (3) * August 2019 (1) * July 2019 (2) * June 2019 (1) * May 2019 (2) * February 2019 (1) * December 2018 (1) * October 2018 (1) See all Search By Tags * devops (11) * observability (8) * k8s (7) * autonomous monitoring (6) * kubernetes (6) * log anomaly detection (6) * log files (6) * logs (5) * metrics anomaly detection (5) * autonomous log monitoring (4) * machine learning (4) * anomaly detection (3) * machine data (3) * monitoring (3) * incident augmentation (2) * incident detection (2) * incident response (2) * root cause (2) * software incident (2) * structured data (2) * support automation (2) * troubleshooting (2) * AI (1) * User Experience (1) * ci/cd forensics (1) * continuous delivery (1) * customer experience (1) * dev/test (1) * dev/test forensics (1) * engineering analytics (1) * fluentd (1) * open source (1) * predictive support (1) * predictive troubleshooting (1) * product analytics (1) * prometheus (1) * structure (1) * unstructured data (1) See all Archive * March 2021 (1) * February 2021 (2) * January 2021 (1) * December 2020 (1) * October 2020 (3) * July 2020 (3) * June 2020 (1) * May 2020 (3) * April 2020 (2) * March 2020 (5) * February 2020 (1) * January 2020 (3) * December 2019 (1) * November 2019 (2) * October 2019 (3) * August 2019 (1) * July 2019 (2) * June 2019 (1) * May 2019 (2) * February 2019 (1) * December 2018 (1) * October 2018 (1) See all logo-footer Links * Product * Videos * Blog * Docs * Company Contact hello@zebrium.com careers@zebrium.com Subscribe to newsletter Privacy Policy Terms of Service (c) 2021 by Zebrium, Inc. *