ARCANNA (Automated Root Cause Analysis Neural Network Assisted) is an open-source Elasticsearch plugin which enables us to automatically detect the root cause of an issue across the entire infrastructure stack by making use of supervised machine learning.
ARCANNA is made up of 3 modules, each playing a critical role in the final output:
- Clustering the events
- Determining the probable root cause
- User-provided feedback
In this article we will take a closer look at the first of these 3 modules: event clustering.
What clustering is
Event clustering is an algorithm used in Machine Learning and data mining in order to group together events which have similar characteristics. More specifically, the algorithm determines a number of clusters to which it assigns the available data depending on similarities between them.
In order to do this we make use of the infrastructure topology and dependencies between devices and applications, in addition to the characteristics, to determine in which cluster an event should go.
How it works
In ARCANNA’s case the theory of clustering still holds just that the words mean different things:
- Data events refers to alerts generated by the infrastructure
- The clusters are represented by groups of devices, applications or services
- The characteristics by which the clusters are formed are things like role (switch, server, application), positioning (geographically or in the infrastructure topology), vendor or any other characteristic that the algorithm determines to be valid
As data is being processed the alerts are enriched based on topology correlation and new fields are added with additional information such as the host, error code and so on. These fields will also indicate which cluster of entities the alert is part of as well as a hierarchical positioning with regards to the topology.
Let’s look at an example.
Our sales team which is using a CRM application hosted on the organization’s private cloud can no longer log in.
As data starts streaming into the Elastic Stack in the form of alerts, unsuccessful login attempts, high latency, error codes and so on, ARCANNA begins to analyze these events to determine which characteristics can be used to create clusters of events.
Once the clusters are defined the events will be enriched with additional fields with information regarding the cluster that they are part of and their position in the hierarchy of the topology. The characteristics themselves are determined based on the events that come in. If several events are related to computers running Microsoft Windows 10 for example those computers will, most likely, be clustered together, similarly, if several alerts with regard to high latency are reported from multiple network devices they will also, most likely, be clustered together.
Why use clustering though?
Event clustering is a simple but very efficient solution to a big problem for IT teams nowadays. With infrastructures growing both horizontally (with the addition of new devices) as well as vertically (with the addition of new technologies such as containers) it is becoming increasingly difficult to determine which alerts are related to which problem.
Similarly because so many devices, applications and services are interconnected if one fails several begin to crash resulting in overlapping alerts and notifications. This means a lot of noise that has to be dealt with and a lot of time spent to find the underlying cause.
By clustering the events we can begin to see which events are more likely to be related to our problem essentially reducing the number of false-positive alerts that we have to look at. Going back to our example it is much simpler to check only the users’ computers running Windows 10 or only the application, rather than starting to engage several teams in order to determine what needs to be fixed.
In part II of our article we will look at how ARCANNA goes further down the rabbit hole to determine the probable root cause of issues with the help of neural networks.