In the previous article of this series we saw how one of the three key elements of ARCANNA works and to be more specific we looked at event clustering.
Let’s recap what are the three modules of ARCANNA that play an important role in the final output:
- Clustering the events
- Probable root cause determination
- User-provided feedback
In this article we will take a closer look at the second of these 3 modules: Probable Root Cause Determination
What is probable root cause determination
Root cause determination is a process of analysis in which we try to discover the main source of a problem. Generally, in IT, when something is not working properly there are several other devices and applications affected as well as users. As such when alerts begin to appear they might all have a common element which is cause them to trigger.
Traditionally a root cause analysis requires cooperation between several IT teams in order to check each alert and determine which is the main source of the issue. The most difficult and tedious task is to separate the false positives (alerts generated by connected events such as users not having internet due to a faulty switch) from the real alerts. Once the issue is identified the support team can fix it and alerts should begin to disappear.
As you can see (and most likely also experienced) this is a difficult process to go through which requires a lot of resources and also takes a long time.
How it works
In ARCANNA’s case the process of root cause determination is automated by making use of a neural network.
As alerts begin to trigger they are captured and sent into the Elastic Stack where they are stored and analyzed. During the analysis process some events are enriched with additional information in the form of extra fields. After the events are enriched they pass through a neural network one at a time to determine which is the root cause of the issue. As the alerts are being analyzed by the neural network, the weights (relevance in the analysis) of some fields get higher priority. The weights of the fields is determined by cross examining the events with the infrastructure topology and dependencies between devices, applications and services.
The end result is a series of events which are tagged either as a probable root cause or a symptom of the overall problem.
Let’s look at an example.
Our sales team which is using a CRM application hosted on the organization’s private cloud can no longer log in.
As data starts streaming into the Elastic Stack in the form of alerts, unsuccessful login attempts, high latency, error codes and so on, ARCANNA begins to analyze these events to determine which characteristics can be used to create clusters of events.
Once the clusters are determined the events are assigned to each cluster based on their characteristics. The characteristics themselves are determined based on the events that come in. If several events are related to computers running Microsoft Windows for example a cluster for Microsoft Windows will be created. If several alerts with regard to high latency are reported from multiple network devices a cluster for those network devices will be created also.
What are the advantages?
The biggest advantage of ARCANNA when it comes to determining the probable root cause of issues is that it is trained directly in your environment.
Out-of-the-box solutions rely heavily on global data from various sources and multiple environments to train machine learning models and neural networks which might not always provide an accurate result. ARCANNA begins its training directly into your infrastructure allowing it to adapt itself to the particularities of your environment without being influenced by outside sources which enables ARCANNA to become highly accurate with each iteration.
As a result, as ARCANNA analyses issues certain events become over time more likely to be a probable root cause because of their past behavior. Additionally if the topology of the infrastructure changes the model can easily adapt itself so that the accuracy does not drop.