Part 2 Machine Learning

The following Blog Post is written by Brian Steele our Technical Development Lead.

These days, the terms “Machine Learning” and “Artificial Intelligence” are phrases that are bandied around a great deal when vendors talk about the features of their Monitoring Solutions. But “What is Machine Learning?” and “How does Machine Learning help Monitoring Tools?”

In our previous Blog we discussed “What is Machine Learning?”, now we ask “How does Machine Learning help Monitoring Tools?”

Part 2 – How does Machine Learning help Monitoring Tools?

The Goal of Machine Learning in Monitoring

The goal of Machine Learning in monitoring is to reduce the administrative and diagnostic workload of engineers to help them get to the root cause of a problem faster.

This is achieved by deploying a Monitoring Solution that uses the following main initiatives augmented by Machine Learning to simplify and automate the root cause analysis process:

1. Correlate ALERTS - Identify alerts that correlate to a single root cause alert.

2. Relevant METRICS - Identify which metrics are relevant to diagnosing the root cause issue.

3. WHAT IS NORMAL - learn what is “normal” for any given metric.

If the environment to be monitored is a simple one then a single monitoring tool may be enough. In a more complex environment, multiple tools may be required to collect the different types of metrics needed to perform comprehensive root cause analysis. Either way, a MLMS (Machine Learning Monitoring Solution) can be implemented by deploying a single tool that has ML (Machine Learning) features or by integrating multiple tools into an AI-Ops tool.

Correlate Alerts

We are all used to getting “hundreds” of alerts for the same root cause issue! A Machine Learning Monitoring Solution can help with this problem through the following mechanisms:

1. SERVICE TOPOLOGY – The Machine Learning Monitoring Solution is configured with an understanding of service topology. In other words, it knows the parent/child relationships between configuration items. This understanding of service topology can be configured manually or learned from monitoring tools that have discovered the hierarchy. Because Machine Learning can see the parent/child relationships it will know that if a parent configuration item fails, then there is little point in taking heed of the alerts coming from its child configuration items; so, it mutes them or labels them as being secondary to the root cause alert.

2. ALL ALERTS ALL THE TIME - During model training, Machine Learning looks at all alerts to see if there is any correlation between seeming unrelated alerts. By monitoring all alerts all the time, Machine Learning may observe that two alerts always occur roughly at the same time making them a prime candidate for correlation. This observation can still be true despite the fact that they don’t appear within the same service topology tree and therefore don’t have any parent/child relationship. By correlating alerts in this way, a Machine Learning Monitoring Solution can consolidate many alerts under a single root cause alert reducing the alert cannon. In the following example, the four alerts (A01, A02, A03, and B01) are consolidated to a single root cause alert, which could be A02:

Relevant Metrics

Consider a critical application that is running slow. The application is hosted on a single server so only the metrics from that server are being monitored to determine the health of the application. However; the application is particularly sensitive to the performance of its disc storage which is hosted on a SAN which is in turn serving multiple applications across the wider landscape. Taking all of this into account, it becomes obvious that the availability and performance of the application is dependant on much more than just its hosting server.

Machine Learning will determine which metrics are relevant to the availability and performance of an application during the selection of the hyper-parameters in the model training phase. In this way, a Machine Learning Monitoring Solution will, without the restriction of human assumptions, have a much more open and comprehensive determination of service availability and performance. This is a massive help when it comes to predicting potential problems (before they manifest themselves) and performing root cause analysis.

What is Normal

For a traditional monitoring tool, the engineer is required to select an appropriate threshold for warning and critical alerts against each metric being monitored. The problem with this approach is that what is an appropriate threshold for one configuration item will be something completely different for another. Therefore, what tends to happen, is that engineers accept the default system-wide thresholds set by the monitoring tool vendor. This gets you by but is inappropriate as a long-term solution because it leads to some conditions not generating alerts while others generate too many alerts.

A Machine Learning Monitoring Solution will learn the correct threshold for any metric based on its history during normal behaviour. The Machine Learning will also consider the time of day or day of the week and month to cater for regular processes like backups etc... which are expected to have an impact. This is a great help in keeping the Monitoring Soluton sufficiently sensitive to issues while reducing or eliminating false alerts.

In Conclusion

You don’t have to configure what data to monitor; Machine Learning will work that out for you.
You don’t have to configure rules and thresholds; Machine Learning will work that out for you.
Machine Learning will predict results even if the metrics have never occurred before
The predictions improve over time as the Machine Learning observes more edge cases.

If you have found this introduction interesting then download our mini-presentation here to get more insights into Machine Learning including the fundamentals of Neural Networks.

Call us today on

01782 752 369

Popular Posts

Tags

Blog Archive