Anomaly Detection Using Alert Groups and Bayesian Networks

by Jerry Lee on May 18, 2022

Metrics or alerts or dashboards? In the Kubernetes observability market, many solution companies are competing fiercely with commercial products and open source-based solutions for dominance. In addition, companies that want to introduce Kubernetes-based services are actively looking for observability solutions, recognizing that it is difficult to develop and operate Kubernetes-based IT services without observability on microservices, PaaS and cloud infrastructure.

I want to ask a question—why are you looking to adopt an observability solution rather than just a monitoring solution? Solutions boast about their rich metrics powered by Kubernetes, alerts based on databases of metrics and show off colorful dashboards that express the data. If you directly install open source-based Prometheus and Grafana, for example, you can obtain a good amount of operational metrics and dashboards in a relatively short time. You will be exposed to more information and a better understanding of your systems if you simply spend some time researching these metrics, alerts and graphs.

But if you take this approach, how much will you be able to proactively take action to prevent operational failures? The DevOps engineers I’ve met have installed the dashboards and stored the metrics and have been able to respond more proactively against incidents through alerts and dashboards, but still find it difficult to predict the possibility of failures and find it even harder to prevent outages. Also, the cost of the observability stack itself is gradually increasing, even as observability solutions do not seem to deliver better results than past monitoring solutions and dashboards.

Alerts Group for Your Observability Targets

To identify observability targets, I pay attention to the alerts that Prometheus generates. In fact, rather than a single alert, I think that by grouping alerts from several different sources makes it easier to forecast that there is something wrong. Let’s take MySQL as an example, one of the most popular open source databases. What I learned while using MySQL is that the rapid increase of storage usage, memory, CPU, connection pool and lock waiting time are preliminary symptoms of a system failure. These occurrences should trigger your DevOps engineers to review the system events and logs to figure out what’s happening.

However, from the point of view of DevOps engineers who are receiving tons of alerts at one-minute intervals, it can be difficult to connect these alerts to a possible system outage. The individual alerts thrown by MySQL exporter, Kubelet, Node Exporter and more need to be classified and grouped using the labels in alerts—such as an instance or namespace to see the relationships so that the DevOps team can perform immediate action to a specific target system or resource.

Sponsorships Available

When reviewing individual alerts, an incident ticket is issued only after a critical-level alert occurs. If you want to respond, it is not already post-processed, and the newly purchased observability system is used for the purpose of providing information necessary to provide a post-mortem report to the management.

MySQL-related alerts can be grouped as:

System resources: Memory, CPU, storage
MySQL internal resources: Connection pool, lock waiting time, slow queries
Networks
Other indications: Aborted connection, transaction handler and more

If we have a database and operation SMEs and can score failure probability by the alert group based on the occurrence of an individual alert, we should be able to estimate the system anomaly and take proactive and preventive actions to address the possible outage.

Bayesian Networks With Alert Groups as Network Nodes

This method can be applied to the famous machine learning model Bayesian network. Bayesian network is a popular method to model uncertain problems when the interpretability is more useful than prediction based on a historical data set. Based on a focus group discussion with system experts, we can interpret the possibility of failure of each alert group by counting the occurrence of alerts and running a Bayesian network statistical model to calculate the probability of system anomaly. Also, we can adopt machine learning to evaluate the performance of the model and continue to improve the quality of the model.

Also, the Bayesian network is a perfect solution to visualize the evaluation model. The model provides a direct way to visualize the structure and show the logic of the probability of anomaly.

Challenges: Knowledge, Labels, Automation and Machine Learning

To build this operational practice requires a project team (with internal and external system SMEs) to discuss what metrics should be used to set up alerts and, if these alerts occur, the probability that they will lead to failure of the system. It is also recommended to schedule a workshop about once every quarter or half-year to review the performance of the model and whether there are any new alerts that should be added or excluded.

Accurate labeling of Prometheus alerts is also important. As we all know, garbage in means garbage out.

This complex operation would not be possible without automated systems. It will not be difficult to collect Prometheus alerts every 30 seconds, one minute or five minutes with a scheduler such as crontab. It would require creating a data map for each alert group set and interfacing it with the Bayesian network model to evaluate the possibility of anomaly. An open source Bayesian network model written in Python can be downloaded from the internet and used and Pgmpy is one of the famous Bayesian network libraries

It may be possible to increase the performance of the model by using machine learning to track the real operation continuously. DevOps teams could track whether anomaly detection is accurate and provide feedback to determine the root cause(s) of incidents by integrating, for example, Jira and Pagerduty.

Join us for KubeCon + CloudNativeCon Europe 2022 in Valencia, Spain (and virtual) from May 16-20—the first in-person European event in three years!