The IT landscape is more dynamic than it has ever been with new technologies and applications being added continuously to an already complex environment. IT infrastructure and operations teams are finding themselves in a continuous cycle of integrating new devices, services, applications all of which are interconnecting and depending on each other to function properly. From a monitoring perspective this leads to numerous alerts and false positives as a result of the failure of a single device on which multiple other devices rely on and the process of determining what is the root cause of the issue is a long and daunting one.
Many engineers have found themselves in endless bridge calls with other teams (some of which shouldn’t even have to be in there) trying to determine why a particular site lost network connectivity or users can’t access a business critical application. These are the type of issues that users expect to be solved within hours if not minutes considering the impact they have on day-to-day business operations. But the fact of the matter is that it’s not always the application itself that has an issue or that there is a faulty switch or router that is causing the problems.
Considering how many other devices stand between the user and applications and services determining which one caused the issue is a painstakingly long task that involves logging multiple servers, routers, switches, checking containers, user access, application logs and so on, all of which are usually handled by different teams. This is not only causing frustration for the users but for the engineers as well because they feel that they have little control over the infrastructure that they manage.
But we live wonderful times in which things like Machine Learning is no longer out of reach when it comes to IT teams and applying those concepts to things like infrastructure monitoring, security analytics or application monitoring. By combining these technologies together with big data and process automation we can start to determine the root cause of an issue in an automated fashion. This translates to less times spent on bridge calls, correct allocation of tickets to the team responsible, fewer false-positives, reduced MTTR (Mean Time To Resolution), improved user experience and less stress for the engineers.
In part two of this article I will explain how this can be done by using open-source technologies to implement an automated root cause analysis.